 the back of your neck. Come down to your shoulders and your upper body. Notice how you're breathing and don't try to control it now. Just let it flow. Come down into your back again because if you have any pain, any tension, any stress that you're harboring, just watch it again. Don't try to control it. Just watch it. Come down towards your buttocks, groin, all sincere, valid. Itches are fine, pains are fine, twitches are fine, not your thighs, your knees, calves. Feel your feet, feel the ground through your shoes or your slippers. Feel like you're firmly planted on the ground. Your weight is not on one particular leg or spread on both feet. Feel what it is to be balanced. The other one is people in red lanyards have requested not to photograph them, so avoid photographing people in red lanyards. The other piece of information is if you open your schedule for today, there is an OTA session listed at the bottom which is of the record session learning data science. It's wrongly listed as to be in room 01 due to a misprint. It would be happening in the banquet hall which is this. So you see an empty session in the banquet hall. The third last empty session that would be the OTR. We would start with the presentation. Deva P. Sitaram from Data Glenn would be talking about bits and jewels, data-driven energy systems. Hello, check. Morning folks. Thanks to the yoga instructor. Is it too loud? Can we adjust the volume a bit? Thanks to the yoga instructor for giving us a great start. I really enjoyed it. So today I'm going to be talking about bits and jewels, data-driven energy systems. So like when I was young, right? This is how the family entertainment used to be. Like, you know, there will be a black and white TV. You just go sit in the hall, watch a TV program together. And there used to be a thing called Chitrahaar or something. I forget the exact name. There used to be movie songs playing there from 8 to 8.30 PM. So we will wait for it thinking about in the school that we are going to watch it, et cetera. Right? So now when you see the media has completely changed. There are so many options, your cell phones, YouTube, iPad, Facebook, what not, right? So what has happened is the media has come from, you know, moved from a centralized paradigm, you know, somebody generating content and sending it to people who passively consume to more like Facebook. Anybody can become a blogger, you know, blogger, whatever. Right? Does anyone know who this one is? This, this lady is her name is Lily Singh. She is from Canada and she's a YouTube star. She made like more than $7.5 million in producing content for YouTube. And she has, you know, millions of subscribers on the channel, right? Think about the days where, you know, there used to be a do version and everything central to now anyone can be a content producer. The same thing is happening in the power grid, you know, from the centralized electricity generation to completely decentralized and more democratized. So I'm going to talk about how IOT and data technologies are enabling such kind of a transition, right? When you look at a conventional power grid, you know, so it's, it's, it's somewhere, you know, far away, the energy gets generated and, you know, and then the energy flows through the transmission network comes to the distribution network and people passively consuming their homes or in their office buildings. So it's a centralized energy generation and, you know, supply always follows the demand. You know, the conference says it has geek. So you must have all seen like, if you wrote your first C program, you will have a static array of 1000 elements. You know, your program will never take more than 1000 elements. It's fixed. And if you have more things, you know, you put more items into the array. Whereas now the energy grids has become like malloc, you know, as C, the, the, the consumer determines how much of energy they want. And how do they work with the generators to adjust their demand? So the peak does not exceed too much. Right. And finally, you know, there is no involvement of consumers in the current or the older form of grids. In the newer grids, they are called the prosimers. They can generate energy, put it into the grid. They can reduce their energy consumption, ship their energy consumption and so on and so forth. Right. So why such a paradigm shift is required? It's because of three reasons. First, when you see there is so much of energy shortage, currently more than one billion people across the world don't have access to electric grid. Even in India, about 200 million people don't have access to electric grid. Right. And second thing is when you look at the power consumption pattern, this is from New York. So on a daily basis, it can fluctuate anywhere between 5000 megawatt to 30,000 megawatt. So that the consumption is too much. The consumption variations is too much. Right. So that is a big problem in the grid because they have to generate and the consumption has to happen instantaneously. Right. So you have to have this kind of capacity to handle all the fluctuations. And finally, the environmental concerns, more than 70% of the greenhouse gas emissions from the world come from energy generation and energy consumption activities. So given these three important reasons, people are talking about decentralized grids. This is already happening. You know, so there are four main aspects of the decentralized grid. One is energy efficiency. You know, how do you actually use anything from LED lights, more energy efficient refrigerators and air conditioners? How do you save your energy? In fact, you must have heard that even in the data centers, they have something called PUE, power utilization efficiency. Basically the amount of energy that goes into computing versus the amount of energy that goes into cooling, they want to reduce the amount of energy that goes into cooling. I've been to a data center in Sweden, Facebook has set up a data center in Sweden, in northern part of Sweden, because it's so close to Arctic and the wind is so cold. So you don't need to have any separate cooling or air conditioner there. So they save energy on that. Google pays like several tens of millions of dollars in just in electricity bills. Right? So then the load shifting. I mean, this is something that we do in Bangalore every day, right? We want to get away from the traffic. Zainab yesterday told me, you know, don't come to this area after eight o'clock, the traffic will be so bad, you know, come during the off peak over, you know, cross the Karpuram bridge before 730. That's what is the load shifting move away from the peak covers and then more distributed generation. You must have seen even in Bangalore, there are a lot of places that are rooftop solar, solar water heaters, etc. And finally, energy injection, you know, you can generate energy and put it into the grid. This is what I was kind of comparing with the Facebook and Twitter. Anybody can become a generator of energy like how anybody can become a content creator now. So I'm going to talk about two things, right? One is about the distributed generation and second is about load shifting, right? This topic is so huge and so vast. And you know, it'll take hours and days to complete it. So I'll just give two examples of how data driven systems are enabling this thing, right? The first one I will talk about is the solar plants. Solar is a very beautiful energy source. First of all, it is clean. It doesn't have any moving parts. Since we are all software people, it's extremely modular and linearly scalable. You can put a one watt or, you know, 0.5 watt solar panel. You must have seen solar powered calculators to even like, you know, 50 megawatt 100 megawatt of solar plants, right? In fact, in power, which is about four hours drive from here, government is setting up like two gigawatt of solar plants, right? So extremely scalable. It's quiet. When you look at wind turbines, they make a lot of noise, disturb the birds, etc. Whereas solar is quiet. Solar is scalable. And the thing is what has happened in when you go to places in China, the pollution is so bad, so much of fog and people celebrate if they see a blue sky there, right? As if it is Diwali or Christmas. They celebrate. Oh, today is clean. So the Chinese government put a mandate that, you know, we have to go towards solar. They gave a lot of subsidies to solar manufacturers and the solar panels have become so cheap and so efficient. Now the Indian panel manufacturers are pushing the government to take protectionist measures. They are asking the government to put extra duty on the Chinese panel manufacturer. They don't want the panels to come in. I think that is pathetic. You know, we have to improve the improve the engineering quality instead of asking the government to put extra duty on the competitors so that you don't have competition, right? But what Chinese have done, it has done great thing to the world. It reduces pollution and gives us more energy. So when you look at managing the solar plants, it's quite complicated because one megawatt of plant takes about five acres of land. Okay. And usually they are in very remote places because that's where the land prices are low and, you know, they usually take the fallow lands and it's extremely hot, right? There are no restrooms, nothing. I've been to these plants to set up our things. You have to schedule your restroom visit. It's like four or five kilometers away. That's how bad it is, right? So it's very hard to find the site engineers to come and work there and, you know, to manage these plants. And there are a lot of things that can go wrong. Birds can come and, you know, drop on the panels, you know, your, you know, circuit breakers can burn because of the extreme heat. And, you know, so you can imagine whatever the problem that can be, that can happen in this area. So what we said is, how do you take data from these kind of environments and how do you help them manage their plants efficiently so you get maximum amount of energy from those plants? So this is our, I mean, I'm one of the co-founders of the company called Dataglin, which is focused on solar plant management or renewable energy management in general. This is the overall architecture, right? You go, we go to a solar plant and we put a data logger, right? This is the point where the IoT technology starts. We collect data from the weather stations. We collect data from the energy meters and inverters. The inverters are the ones that take the solar energy, which is in DC form, converts into AC and puts it into the grid actually. When you look at the solar construction, what will happen is the solar panels will be connected in series, like, you know, like a string. So that is called a string, like a string of poles is a string of panels. And those strings are connected in parallel to each other, like an array. I mean, it's called an array of solar panels and that goes to a SMU and the repeater and then inverter and it goes into the grid. So we collect data from all these components in the strings, from the inverters and all these things and send the data to our cloud over HTTP REST platform, sorry, HTTP REST protocol or MQTT. And we also collect data from other services, like weather forecasting services and, uh, and, you know, do various analytics and provide these kind of interfaces like solar dashboards. We have smartphone apps. You know, do we do various analytics for improving the operation and maintenance efficiency? And we give real time alerts to site engineers so that they can take corrective actions. So this is our solar dashboard. I will not go into details of this, but it gives you a very quick overview. Like there are companies like Greenco, Wari, they have multiple plans. They want to see what is happening at their plants in different parts of India. They can come to this dashboard and see what is happening, how much energy is generating, how many places are down and so on and so forth. So I just want to talk about a little bit about the various analysis that we do. So one is a spatio-temporal analysis. As I said, you know, the solar plants are huge and the, you know, all the equipment are distributed over vast spaces. So we do a spatio-temporal analysis. How is the each string generating? How does it compare with the other strings? How does the performance compare with the other panels? Is there any substantial reduction in output? You know, is there variations in the output, etc. So one of the analysis we found was one particular string was producing less at, you know, three o'clock in the evening and again around 9.30 in the morning actually. Every day consistently for the first six months of the year. So what we found out is that there is a handrail because it's a roof mounted solar plant. So there is a handrail to go up and clean the panels, etc. That handrail was casting a shadow in the morning on this side of the panels and then in the evening on the other side of the panels. So that was costing the heat and the panels were getting damaged. So using the spatio-temporal analysis, we could find out, you know, what is the problem and we could not exactly say that because of the handrail, but we could identify what the problem was. The site engineer can go and inspect at a particular point. Otherwise, they would have never caught that this is happening until they see the data. So and also different root causes, as I said, in a solar plant, there are various factors that affect the performance of the the plant. It could be the amount of sunlight that is falling there, the amount of humidity, whether modules are getting too hot or you know the ambient heat is too much, so on and so forth. So we do like a correlation analysis to find out what are the factors that are up on the generation side that what we do. This is the part of the work I did when I was in energy research of IBM. I founded and led the energy research for IBM in India. So when you look at it, you know, there is a lot of fluctuations in the demand, as I already told you, right? So one of the problem that happens is how do you maintain the equipment to handle the peak loads, right? So when you see this 1980 in New England, the top capacity, like let's say if you have a six-lane highway in your free way, the one lane gets used for like one hour of the year or you know two hours of the year, but you have to put enough money to construct such huge freeways. So this is getting worse and worse, you know, because the peak to average ratio is increasing like crazy, right? So that is what is happening in the 36 years, it has happened like this. So many of the network operators, especially in India, the utilities are already operating at loss. They are not able to maintain such infrastructure and also the electricity gets traded in real time. It's if you can look up, you know, there is something called India energy exchange where the producers and consumers trade energy. So the market prices also fluctuate like crazy and the wind speed, right? Also fluctuates and your generation fluctuates. All these factors means you need to have better peak load alleviation mechanisms. So when you look at a conventional power balancing thing, there are like plans that cater to the base load, right? Only like, you know, just like one lane highway or something like that. That is always available. They are very expensive to construct and very cheap to operate. So otherwise the peak loads, they are very cheap to construct and very expensive to operate. So that is what this curve is showing. If I can reduce some of the peak load, right, peak demand to here, I can save this much of money in the for serving the peak load, right? So what have been the historic approaches? So this is one of the reasons why the daylight saving time came up during the Indochina war. They said, you know, they are going to shift the daylight a little bit so you can use more of the sunlight. So 6 a.m. became 5 a.m. So that you burn less of electricity, right? So in Tokyo, they could not serve electricity. They started doing brownouts. Brownouts is instead of 230 volts, you give it the energy at a little less voltage like 220 volts or something like that. In California, they started doing rolling blackout during the Enron scandal. You know, they had so much of energy shortage. They would say this area, you know, the Hebaal area is going to have a blackout for next two hours and so on and so forth. But this is very common to us. At least when I grew up in Tamil Nadu, we used to have eight hours of blackout and six hours of blackout and you know, I've donated so much of blood to mosquitoes. People say donate blood. I mean, I donated blood and saved wildlife so much, right? So how do you how do you do this, right? There is something called the demand response. As I said, you know, this is something we do in Bangalore every day. You know, you try to avoid the peak traffic. Similarly, you try to avoid the peak load, right? So when you look at it, you know, like, let's say this is this is how the demand is. What I do is if I can shift some load to before and some load to after so the peak to average ratio can come down. That is what is the essential concept of demand response. So there are a lot of things that can be peak shifted, right? You know, I don't know why the cursor is moving separately here. Anyway, so there is a washer and dryer. You know, you don't have to run it at that time. If you have power storage, what can you do? Your rice cooker, your water heater is very important. You know, every water heater consumes about two watt. So, yeah, the water heater consumes about two kilowatt and everyone starts a water heater at the same time to go to work, etc. So that increases the peak load. When you look at the Bangalore, it's a bimodal power, power load, power distribution, distribution for power consumption. The morning when people go to work and when they come back in the evening and start using various equipment, so you see two huge peaks that is happening, right? So so what we said is in the western countries, especially Europe and the US, they said we'll create a demand response program. We will send a message to our consumers saying that, you know, reduce your consumption at this point or increase your consumption at this point. The problem with that approaches in India, especially there are 800 million people have access to electric grid and about 150 million people have access to Internet. You know, so even like when they have the smartphone, I mean, even I don't try to turn on the, you know, cellular data connectivity, we want to save them, you know, money as much possible. So we said like, you know, how can we create a tool that can, you know, do this demand response without depending on the Internet? So that's what we called N plug. I'm sure, you know, most of you must be familiar with Unix, right? Unix has a nice command that reduces the priority of your job. So that's what is a nice plug is what we created here. So what you do is you take your appliance and plug it into this guy and take this guy and plug it into the wall socket. Right. So this guy will sense the condition of the grid. Right. So I'll go into the condition of the details later, but it's about $20 in large volumes. And you know, it's very simple to install and operate like how we installed the power strip, even though it was a little bit complicated today, but generally, you know, it's as as simple as that. Right. So when you look at the voltage levels in your grid, right, you see the fluctuation. So we measured the voltage level in a in Bangalore for seven days and you know, that is a fluctuation. Each color is for a different day. And this is how it fluctuates throughout the day. So voltage actually indicates the load level on the grid, you know, closest the grid that is closest to your home or the transformer that is closest to your home or the measuring point and frequency indicates imbalance between the generation and load. Right. So that is what is the two conditions. So using the two indicators without any signal from the utility, you can decide whether to shift your load or not shift your load. Okay. This is the peak hour. So I will shift my load. This is not an off peak. This is an off peak hour. I can consume right now. Right. So essentially this is what is the algorithm. Right. Fundamentally, if the, you know, your current voltage is less than the lower voltage level, you know, the threshold, then you don't consume it. And if the current voltage level is greater than the upper threshold, that means the grid is less loaded and you do it. You consume it and otherwise you compute a probability as a proportion of this distribution. VC minus VL and VU minus VL. Right. So it works beautifully and you know, this particular work was given an award MIT TR 35 award to my colleague who led the work. Her name is Tanuja. So the TR 35 award is given to 35 innovators under the age of 35, like the past winners include like Google founder and Tesla founder, etc. Because it's extremely powerful to help the grid stability and it as you can see, it involves IOT and sensing and it's local, a local analytics. It's edge computing. You know, it's not like you're transferring data to some other place doing the analytics and you can do this. So at this point, I just want to digress and say what we have been thinking about is like, how do you do hierarchical analytic systems? Right. When you look at the architecture of the brain, there is the lowest part of the brain, which is called the amygdala, which is the reptilian brain that that is mostly the stimulus and response. You know, when you touch something hot, you immediately take your hand away. So that is very quick, but it cannot do any signal processing any simple processing, etc. Right. So that is the lowest part of the brain and the middle part of the brain that does, you know, a little bit of language and everything and the slowest part of brain in the neocortex, which can do like things with a future planning. You can, you know, plan how do you make a strategy? How do you go and hunt an animal kind of thing? So we want, we propose this kind of thing for the energy systems also for the low latency and avoiding network thing, do computing and analytics at the edge and do something at the substation level or the transformer level and the highest level is the cloud level. You get the data from all the places and do computing, which is slow, but you can do more sophisticated analytics. So this is what is the basically in simulation. We showed how this can be done, how this n plug can be effective. As you can see the two peaks here completely gets distributed over multiple areas. There are, there is a concept called rebound effect. Like, like, let's say everybody shift from the peak over to the off peak cover, you create a new, new peak cover, but that is a, that is another, you know, detailed topic, which I don't want to discuss, but you know, something that we are doing this demand response, you need to look into that, right? So there are two discuss, right? I mean, so there is this paper, you know, if you can look at search for David Smalley, he is a Nobel Prize winner and he took like 10 important problems that are, that we have like, you know, water, terrorism, poverty and everything. And he says energy is the most important problem if we can get access to clean and cheap energy, every other problem can be solved. So to give you an example, right? There's a water scarcity, but if we find, you know, very cheap and clean energy for desalinating the seawater, we have so much of seawater, you know, we can solve the water scarcity problem. So he, you know, he goes through all the other nine problems and says how it can be solved with this. So Indian government is pushing for 100 gigawatt solar and 75 gigawatt of wind and all that stuff, basically 200 gigawatt of energy from clean energy sources by 2022. And also government has said by 2030, all the transportation, all the consumer transportation will be electric vehicles. They moved Ashok Jinjanwala, professor from IIT Madras to the central government to lead this effort. So there will be no internal combustion engines, no fossil fuels, no pollution, right? And what is happening is there are like the storage costs, especially you must have heard about the Tesla guy, Elon Musk, you started something called a giga factory in Las Vegas. So the lithium ion battery prices are coming down like really drastically, right? And for instance, recently the government is working on shifting the Andaman Nicobar islands, the only source of electricity there is diesel. They're completely converting that into a solar plus storage, storage for the night, whatever energy gets generated in the morning and use it in the evening. Right? And there is the concerns about the climate change and the Paris Accord and you know, like Trump, you know, cancelling the US commitments to the Paris Accord. But the good thing is 40 of the 50 US states have committed to the Paris Accord and they are saying that they are going to stick to the same levels of, you know, clean energy and pollution and environmental protection, etc. So a lot of interesting things are happening, right? And Zainab told me that, you know, the key thing is I need to give a few take away items. You know, I can talk all the academic and theory things or whatever, but what is it for a practitioner to look at? Right? So I want to recommend, you know, a few books, right? I'm a dilettante in machine learning and data science. I'm a more embedded guy. So this book like from the UK Open University on the time series analysis, I found it very useful, very easy to understand. And this is a very nice book by Cambridge University physicist sustainable energy without hot air. It is completely available for free and he is also is also a data scientist. He, he looks at like brilliant data and comes up with very interesting insights. I would highly recommend looking up his site. There are public data sets for you to look at, you know, red data set. They have taken data from multiple homes at minute level granularity and new a is a National Institute of Wind Energy. They have published a lot of data and Indian Meteorological Department publishes a lot of weather data so you can look at all these things. You can do analytics. You can do what you can do. And finally, if you want to do any kind of do it yourself, you know, you can buy energy meters, to find out what your consumption is. You can collect data from your home from your research lab. You can do various kind of analysis. You can build your own equipment. Nowadays, all also these micro inverters are coming up. Maybe you can set up solar on your home. You can try it out. And finally, we are hiring data engineers and data scientists. We also take interns. So if anyone interested, please contact me offline. Thank you. We take questions regarding the problem of regarding the problem of solving the peak peak hours load problem. Do you think dynamic pricing of the consumption will solve that problem? Yeah, yes and no. So the dynamic pricing is already there, right? There are two issues. Actually, one is that there are government regulations. They don't want to expose the consumers to the fluctuation in the real time price markets. And second thing is by the time you take a decision for the place, the condition could have already changed because as I said, right, things are changing at the speed of light. So either you have to have a time of use pricing, which is already implemented. Say for instance, in Tamil Nadu, the peak hours between 6 p.m. and 10 p.m. the electricity prices one rupee per kilowatt hour more. That is what is that I made 20% more for that time so that they do it. But still you don't get this kind of smooth level of peak reduction. So there are companies. What they do is there are there is a company called inner knock and tell you what they do is they talk to the industries and they say that, okay, if we send you a signal, you reduce your energy consumption at that time, we will give you financial incentives for that. So that kind of mechanisms are also existing. There are various things, peak load reduction credits, CPP, which is a critical peak pricing and so on and so forth. Hi, thanks. This is two questions. One is that a lot of these problems, of course, in terms of load balancing or low generation have existed for decades in the past, I'm assuming, you know, largely I'm not able to hear you well. Sorry. Can you hear me now? Yeah. I think a lot of the use cases you discussed, of course, have existed in the energy networks for decades. In the past, I imagine a lot of it was done with regression modeling. Question is. What modeling? Sorry, regression modeling, I suspect. How is machine learning changed? You know, the ability to predict to deliver some of these use cases in the last 15 years. That's question one. The second is that when you look at the value that you can attend to use cases that are at the edge or at the substations or things that you deliver from central, you know, cloud processing, how much value do each of those stages typically give in terms of energy efficiency or better management of the load? Do you see it's at the edge that ends up giving a lot of the value or is it stuff that you do at the center that gives maximum value? Okay. So the first question about like how machine learning improved load balancing, right? Is that is that your first question actually? So what happened is first of all, as I was saying, right, they used to predict only how much is going to be the load in the area. Now the generation is also unpredictable because the fluctuation in the solar and everything. Second thing is, let's say if you want to participate in a demand response program. So we even published a paper, you know, user sensitive scheduling of your appliance scheduling. Let's say if you are in a commercial building and you want to participate in a demand response program, you want to find out when people are going to be using particular areas or when people are going to be, you know, say, for instance, right? You know that there is going to be a power cut between 11 a.m. and noon every day. And then you are going to be using diesel generator to do the cooling at that time. When, when you are using diesel generator, it costs about 20 rupees per kilowatt hour because you get like three, three units of energy per liter of diesel. So each unit becomes about 20 rupees. So what are the building manager can say is I'm going to pre cool the building a little bit before. I know usually this area is going to be more occupied with during the power cut hours. So I'm going to predict the movement on the occupancy and I'm going to pre cool the area and reduce my energy cost. So that is one thing, right? And the second thing is like for a consumer, I want to see how do when I when I want to schedule my appliance usage, I don't want the inconvenience myself to participate in the demand response program, especially when I say myself, it can be a building manager who's doing the participation for all the consume all the consumers are occupants in that area by doing the machine learning. He can learn the preferences and then do the scheduling accordingly. So that those are the two examples that I could come up with and your second question was what was your second? Can you remind me again? Yeah. Yeah, so it depends on the kind of thing. So there's a concept called virtual power plants or you know, demand aggregation. Let's say there are five buildings right next to each other, right? So let's say one building wants to consume a little bit more extra energy or think of it as like five data centers in the like in a Silicon Valley area kind of thing. Suddenly one one data center has more load. They can say that, you know, I will balance it with the next data center over. So it is a substation level load aggregation or a virtual power plant kind of thing. So you're doing it at the substation level. So at the end level is what I already explained. You know, how do you do that? And the final thing is the cloud thing. You know, I want to decide how much generation I want to do for the entire area or how much is the price going to be or incentive going to be for the entire community kind of thing that I do at the cloud level. I don't know the answer to the question. Because the the objectives are a little different, right? I mean, you know, it is not like the you know, it's not about the energy efficiency, right? I mean, it's each guy is participating with a different objective like the building at the lowest level. I want to operate my local network most safely and optimally and also economically. That is at that level. Whereas the utility guy tries to minimize the amount of incentives they want to give to the customers. And whereas in the it is like when the two customers are cooperating and reducing the demand together, it becomes statistical division multiplexing kind of thing, right? I totally we have this much of energy. I will share some of my capacity with you. So I am not able to measure it in terms of the energy efficient not rely on the thing. So that's why when you bring a radio clock from the US and put it here, right? I mean, the radio clock counts the 60 Hertz in the US or 50 Hertz in India because our frequency fluctuates so much time the clock starts drifting because it it's not able to count like it's every 50th cycle is one second over and I can count. It doesn't happen. So it's very hard to do that. Last question. Nice talk. So one solution which I could see when I've been to UK recently is first they pay for the. I'm not able to hear you. So one solution which could be possible is in UK they pay for the electricity and then they consume. So they can actually monitor how much of the money that they're paid is pending. And depending on that, they can use the electrical appliances in their house. So would that kind of a solution be helpful in estimating the demand? And being much more deterministic. Okay. I was not able to hear you well, but I kind of understand what you're saying. So what you're saying is like by looking at the electricity bill you want to you infer prepaid. So actually there are prepaid electricity meters in some parts like in places like South Africa, they do have the prepaid electricity meters. I heard even in Maharashtra, some places they give the prepaid electricity meter. So yes, definitely using the economics to control electricity consumption is very good actually, right? When I was growing up, if I leave the light on my father would come and give me a slap, right? I mean, I leave the room. I turn the light off and come, but what happens is that is the person who is using it is also paying for the electricity bill. Like let's say if you are working in a commercial building, I used to work in money at the tech park. People would just come into the conference room and forget to turn on turn the AC off or lights off, etc. Because they are not the ones who are paying for the consumption, right? So and second thing is you don't even know where the wastages like we did another work in which we found out what are the anomalies of some of the appliances like there was a refrigerator that was continuously leaking, consuming energy. It was never going into an idle mode because the gasket that was around the refrigerator door was broken actually. So the hot air was going in and the refrigerator was cooling it, cooling it, cooling it kind of thing. So there is so much of energy wastage that happens. So those are the things that cannot be cannot be solved just by the economic means. Yeah. Thanks, they were for this presentation next on stage. We would like to call from sensor TV who's going to tell you about what's on TV. Also request to all members of the audience. Please switch off your phones or keep them on silent during the presentation. It's slightly disrespectful to the speaker if your phone rings in between. Am I audible? Well, happy to be here. I'm Bharath. So we are in this company called Sensara. We are trying to reimagine how television should work keeping data science at the center of it. So I also have I'm sure you'll have a lot of questions they'll they're going to span from ethics to privacy to what is possible, what's not do any do people watch TV or not everything? So it'll be around to take a lot of these questions. The focus of this talk is a lot more on on the data science part of it. More so the information retrieval part of it. We're trying to unravel television information retrieval here. How many of you watch television? So how many of you say like I don't care about TV I stopped watching TV sometime back and what were the reasons you stopped watching? And advertisements ads is like in fact in our company we watch a lot of TV, right? We do not watch TV at home, but at work we all watch TV and guess what we watch? We watch ad breaks. Okay, you'll know why. So I'm going to mostly talk about our experiences with building this product called ad breaks.in. I've done a lot of work in information that you will dabbled in for like 15 years in search. Also worked at Google got a PhD along the way from ISE, but this is like the best piece of work we've done. There's also a by here and Elvis who contributed a lot to this work. So we wanted to kind of document and share our learnings here, whatever we've done. So that's the effort here. You'll hear a lot about ad breaks.in and in terms of terminology, we use a lot of frequent sequence mining. There were architecture decisions that we had to make while we did all this and interesting attribute about how do you even measure the system? How do you know like how good it is? How bad it is, right? While you see this whole thing, you will start thinking, right? You will start thinking about, oh, what if there were TV ad blockers? Okay, so that'll be the first question that'll come out of privacy is obviously a lot. You'll start thinking about the future of TV, right? And in large, what is this notion of TV? Information retrieval, right? So I spent a lot of time on search, but the nature of search is so different in television because these are real live streams that are moving. In fact, I could see some parallels with the previous talk because he's also looking at real time data, right? So TV is a lot more real time as well. Very quickly, another thing that I'm going to set ground truth is mostly I'm going to talk about linear TV, right? So linear TV is the TV that you always knew. This is the TV that come through your set of boxes, the Tata Sky, Airtel, maybe Hathaway, right? So linear TV is like a set of reverse set of flowing. There's the streaming content. Somebody's chosen what is going to come at different points in time. You can just dip and drink, drink from a stream, right? You, you cannot stop. You can't stop a river for long. Of course, there are DVR systems, but that's not what you're going to focus on. The opposite of this is nonlinear TV, which is the Netflix model. And the way I look at Netflix is it's basically the natural evolution of DVD players. Instead of having your like the DVD disc or the CD disc at home and running to a library to get them, you're just running to the internet and downloading it and then watching, right? We are not going to focus on nonlinear TV because nonlinear TV is a very simple, easy problem. Linear TV actually presents a lot more harder problems to solve. And this is where the volumes are throughout the world, even though that there's a lot of PR towards nonlinear TV, Netflix kind of fuels a lot of this, most people are actually glued to linear TV, okay? What's interesting now is this notion of connected linear TV that has come in. So what is connected linear TV? You're thinking of about probably 60, 70 years of work on television, but they've always designed keeping broadcast in mind. There was no way of getting feedback, right? Some, some broadcaster push content to satellite or maybe through terrestrial transmission came down the wire and then you had conditional access systems that did decryption and all that, but there was no feedback at all, right? But that has just changed because the box on which you are watching television is getting connected, right? The box on which you're watching television is either becoming a smart box and you're increasingly see this happen in India. In fact, both Videocon and Airtel have just launched their hybrid TV boxes or you might be controlling television through a smart remote that's on your phone. You might say, okay, I want to let me decide what to watch on TV, but I'll decide what to watch using my phone, right? The phone is a connected device, right? We have a product here. What we've done is we, you have a small infrared blaster that you can kind of keep in your home and once you do that, your mobile phone along with our app can become a smart remote. You can then go around talking to it and say, Starplus is going to change channels to Starplus. You can say, show me, take me to a channel that has cricket right now. It'll take you to that. So we have a conversational voice interface that we've put into a phone. We already have a lot of users. It's very vague, but I hope you can see some small bubbles that are there, right? We have data. These are actual users of our app and product all over India. So we have a lot of data coming in from everywhere. Right? And what's interesting here is we know what people are watching. So this question, right? Do people watch TV? It's an interesting graph that we've been looking at. In fact, I love this graph, right? Every day, every time somebody says like our people watching TV, I want to show this. This is something that's there always. Okay. So these are all different days. Different days of the week. I just have the last week's data here. Different days of the week. What do you notice here? Right? Why do they always look the same? So TV watching is like a habit. People just go back home, the switch on TV and watch for, uh, so this is like peak time always speaks at eight PM every day, which in the TV world is called the prime time television, right? So eight, eight, eight PM every day, right? They kind of peak. What's interesting is, but this is all data from our product that we get. There's always an afternoon bump. We will just finish their lunch or right. And then they want to relax again. Could be largely house wise because afternoon is the before, before lunch. They need to kind of cook and all that. Uh, and then after lunch, they have some time to, uh, to spend right before the tea break kind of starts. Uh, so there's always an afternoon bump. What do you notice here? This is interesting. This is suddenly you have Sunday. The afternoon bump is actually way higher. Okay. Sunday afternoon's people will tend to see a lot of TV. Right? The other thing to notice here is people put a solid three hours of TV every day. These, these are all there every day of the week. Right? And this is a graph that just repeats forever with even with our app. So what he noticed is people tend to it's, it's a habit. It's in front of TV and they continue to watch this. Important for you. The data is coming, right? So you know what people are watching TV because they're watching TV through connected interfaces. What can you do with all this data? Right? Suddenly you, you were looking at a system that was completely built, keeping broadcast in mind. No feedback neither now data is coming. Are you ready with all the data? Right? Turns out the TV industry is not ready. They have no clue what's happening on TV because all of that is lost. It's a complex ecosystem that has like six, seven different players, different broadcaster, different guy produces content, different guy produces metadata. All of this information is lost and the people who actually care, they do not have access to this data. And if somebody says, yeah, I want it. He needs the full line to change. Okay. So we look, look at this problem and said, can we use machine learning to kind of do something about this? So our goal is this. Can we know what exactly is on TV? Okay. So I'm not Starkless. I'm not HBO. But can I know what's running on HBO right now? Not at the level of, hey, there's this show that's running at this point in time. But can I know what's on screen? Do I know what ad is running? Right? All that, that's, that's our goal. We want to prepare television to such an extent that you actually know what data is available. The state of metadata today on TV is this. It's, it's a picture most of you recognize, right? So it's your TV guide, something that you hate. Some of you people may have stopped watching TV because this, this is the contest that we used to do. We'll say randomly tell a person, okay, go to Starkless now, okay, give them a TV and give them a remote. And then they need to go down hierarchically from the top and you kind of spend five minutes just to go to a channel. So most people, what they do is they've stopped trying to look for something they want, but just do up and down arrow key on, on the television and they spend hours together like that. Okay. So this is the guide. What is the guide show? It shows what show is running, maybe a little bit of information about, about this. But nothing more. Can we get a lot more information? Now the thing is now we are saying it's a connected television. We want to prepare the world for getting a lot more information. Can we get information about the news topics you just watched? Right? Are you interested in the Bihar episode that's happening? Are you interested in Trump? At what point do you switch channels away? Right? You know what's coming on TV at this point in time. The songs that you actually wait for, you put on a maybe a channel like before your music and suddenly a song, song that you like kind of comes up. You may turn up the volume for which, for which song do you do that, right? You have the data. The actors you did not miss. Okay. So television industry itself is, is trying to be a lot more data driven and especially you would notice that all TV serials are becoming long series, right? They run for several, several months. Do you know which actor or actress kind of evokes emotion amongst people so that they tend not to switch away from the channel at that point in time? Then you would probably give them a little bit more airtime the next time, give them make them a more central character in your, in your script. Right? So can you bring that up? The ads you did not switch away from, right? So, so we are getting to a world where we are trying to get this kind of data insight into what exactly is happening on TV. For the rest of the talk, I'm going to stick to what we did with ads because we have demos of what we're doing with actors and so on. Happy to kind of show you demos of that through the day. Okay. In some essence, if, if some of you have bought mezzan fire TV or been using prime videos or something like that, you're trying to build something like X Ray, but for linear TV. So X Ray is this small product where at any point in time you're watching a movie, you can pause it and it comes up with these small things on the screen and says, okay, these are the people on the screen right now and you can then jump off and try to get more tidbits about these individuals. This was easy because they were looking at canned content. When somebody makes a DVD, they put scene level information and tag it and send it out to you. So it was easy to build this product. But television is live, right? Some news anchor is coming. He's on the field. A lot of different things are happening. Can you build that level of information for, for linear TV? So I'll quickly give you a demo of our product ad breaks and I'm going to venture off to give you a live demo. Okay. Because the network has been stable today. So this is a product that you can also go check out and look at. What we've done is we've built a comprehensive repository of all TV ads that are running in India. It's a, it's a system that learns for itself. Nobody tells us, hey, these are the ads. These are these are house promotions. House promotion is a is an ad for a TV show, right? So nobody tells us these things. Nobody tells us these are sponsored. These are songs. So we've, we've just watched TV for a period of time. We continue to watch it all the time. It's a real time system that keeps learning and it automatically comes up with, hey, this clip is an ad, this clip is a house promotion. This clip is a title sequence. This clip is a song. It comes up with this automatically. Right? So ad breaks is showing you the ads part of this and some interesting statistics. We've been monitoring Indian television about 250 channels. We are continuously mining them and we've discovered just over the last week there have been 12,000 different ads, different unique ads that have that have been running on Indian television. How many times did they run? Look at this number. 640,000 times. So 12,000 ads were actually repeated 640,000 times. Okay. This is why some of you stop watching TV because it's the immense amount of repetition. The reason why that repetition was there was because they didn't have feedback. Right? You build better systems. They will go back and say, hey, this guy has seen this ad four times. I'm not going to show it again. Right? Then you can build better personalize systems there. Right? Now this system is a life system. This is actually now if someone can grab a TV somewhere, you can see it's actually changing all the time life. Okay. It's guys as this ad of hair and care fruit oils. We will soap is coming in Sunday. We right now. Right? Picks HD as something. Bajaj Pulsar. Let me see if I can click through on violet loads. I'll I can see it's all dynamic and this is all built up automatically. It's a system that we've built for every ad. We know how many channels it's been laid on. How's the breakdown of this distribution of our languages different creatives and media for this okay? This is a Pulsar ad that we've discovered in the Telugu language. Okay? It might pause, but that's okay. I'm in different languages. This is a Hindi variant of this. Right? So all of them have been put together here to make up this page. We also know who comes before who comes after and so on. Okay? So this was all being shown in real time. Right? So this is the life interface was showing this in real time. Now what if I take a TV stream, a full long TV stream and I start putting markers on this and say, okay, this is what happened at this point in time. So I'll give you a visual of that to show you how our X-ray product actually works. So this is a star plus this timeline and to make it easier. This is a real time system, but I'm going to play back something that we stored yesterday. Uh, this is yesterday's star plus at around close to 3 30 or something. I want you guys to be in in top 12 performance bar bar bar standard TV, right? You would have seen stuff like this, but what I want you to notice is as the timeline kind of proceeds, we are able to tag and detect what actually came in that stream and mark it out. I'm going to read up because it's a little hard to read here. Jill Selfie camera, OPPO F3. So we've discovered this for the OPPO on screen. I'm a Chandni's wedding planner, right? Pandia or Tufan. So this clip turns out to be a house promotion. It's an ad for another show called, uh, this pyaar ko kya? I don't know. Okay. Sir, poori rat band. So, what we got was just this. As the scene changes, we are able to tell. Okay, this is Big Bazaar. We are testing ambipure against the house is smelling. It's a strong smell. In fact, this is an internal interface. What we're also trying to say is as the stream was flowing, we are actually trying to match so many different variants because ambipure will have like 12 different variants for different languages. Right? So we we're trying to find out what is the best match here and that's what is showing up here. It says that, uh, this guy was speaking in English, right? So it looks like the English one has the best match. Even I can't read from here. So it had a 75% match on on that, uh, the Hindi version had a 56% match. So we are actually trying to see which one is the best fit for this and then taking a call on, uh, so let me move a bit forward to the end of this. Awesome. But the same old hair. Right? So now let me zoom out and you'll see the scale of this. Right? This is the whole day. We've kind of mapped all of TV and put tags on this, including who was on screen and so on and so forth. This is the product that we have. We also have customers for this. Uh, but let me go through how exactly this was done. Let's think about linear TV for a while. Right? So what goes into linear TV? What is it made up of? Start, try to start thinking from a content production angle. Right? There's, there's, there's obviously the content segment, the content segment is, uh, is the content, which is actually why you're switching on TV and then TV has put all these things together interval one, uh, for you. Then you have the title sequence. I'm going to play these clips so that you kind of engrains into you on how are they? This is the title sequence every day. The episode comes in fact, this is the most running episode forever in Indian TV. I think has all the records and there is a channel that plays this out of 24 hours. It just plays this for 18 hours. Right? Uh, okay. And this title sequence is probably repeated like so many times ever. Uh, this is the title sequence. Nobody told us this is the title sequence. We need to honor that. Right? And then this is a break marker. So a lot of people can recognize the break marker. You know, even if they're in the next room, they're like, okay, break, come on, let's back go back. Right? So this is a break marker. They're like sentinels before and after the, uh, the ad break comes up. Nobody told us break marker. In fact, these are all real clips that I've generated and put over here that are that our machine is able to generate. This is a house promotion. It has even machine such me here at the brim. Act of problem. All four of them are were actually about Tarak Mehta and one of them is trying to advertise tomorrow show. One of them is a small, uh, break marker. One of them is a title sequence and the other one is is right? So our job is to reverse engineer these things out of television by just watching TV, right? Not, you're not like humans that are watching our machines are watching and they're trying to get them out. In other words, can you just take the full video stream of television and come back with these is markers, tons of applications for this. If you want to prepare linear TV to convert it into catch up TV so that you segment out the content parts and the ads parts and then you can stream this online and then you can, you can choose to put different types of ads over there, right? So it's important to know what TV is made up of because the broadcaster doesn't even have the technology to do all these things himself. Uh, we'll think a little bit more about, uh, what does it mean to build metadata for ad breaks in particular? So, although we are doing a lot more, but, uh, for the context for this of this talk, I want to focus on ad breaks. It's, uh, TV is continuous streaming video. Nobody told us what are the ads every day. There are tons of new ads that are coming up, uh, 60 FPS of features, which has aligned audio and it has some EPG metadata. So you know, like this show is supposed to start at this time and end at this time. So this is the context. No standards are compl, uh, are, uh, compliances. There are 800 different channels, new channels coming up all the day. This is just India and worldwide. If you look at this across languages, uh, they, you tend to use different kind of systems, right? So, so it's not an IT problem, uh, because it is too much diversity there. No watermarks, uh, and if you don't know about watermarks, just don't bother, right? So because they don't help, he looks like this regular expression. Yeah, you have a content segment and then you have a break marker optional and then followed by an ad or a house promotion, uh, one or more of them, then you have a break marker again, optional and then the content segment continues. This is a regular expression and this is the ad break and we want to get all these clips out and start packing them. So let's think about, are there statistical properties of, of each of them that we can exploit to build a system that works? One thing to notice here is, uh, also in my experience, I've been, uh, more, I tend to call myself more an information retrieval guy because you care more about the semantics of information and see if you can exploit that in a given domain to make your job easier. So for, for me, like machine learning is a tool and if machine learning works so good, uh, if you have to use supervised unsupervised doesn't matter, right? Whatever works, but having some domain knowledge helps you, uh, solve a problem faster. So quickly you want to start understanding the domain and see what, what, what works here. Break markers. There are signature sequences of video or audio that are specific to a channel and or a show. There is this interesting property of break markers that the break marker when Tarak Mata is coming is going to be different from when Ishqbaz is coming from when that 70 show is coming from when Silicon Valley is coming. The break markers are always different, right? But they always repeat whenever this show is coming up. So if you start analyzing the, the regions of, uh, video that always come around, uh, when this show is telecast, then you have a better chance of getting break markers. Uh, they're very seldom seen elsewhere. So maybe that, that gives you, uh, some clues. There are always sentinels that separate the content from ads. There's the start of an ad break at the end of an ad break. What about ads? Ads are short, very short. It costs money to put them on. So the, the ads are like either 10 seconders, 15 seconders, 20, right? And so on till 30 seconds. Very few ads are a minute long. Heavy on audio. The thing with advertising is even if, even if you're not watching TV, you should be able to hear the brand. That's how their ads are designed. So they're very heavy on audio. They repeat a lot. This is what you're all a night of it, right? So, uh, they repeat a lot. They occur across channels. They're not loyal to one show. They're not loyal to one channel because you want to maximize impressions. They just occur across channels. They always come together, right? They've been bundled together. So there's a lot of locality of ads. If you think there is one ad somewhere, there's a very high likelihood there's one more ad either before it or after it, right? Because, uh, so you can use these properties. What about house promotions? They feel very similar to ads. They always come in ad breaks. They're like ads for other shows, but they're, they're different. They're the thing about house promotions is they change every day. Typically an ad stays the same for an entire campaign and maybe about a month. Uh, or if the advertiser has a lot of money, he might be doing something every week, but house promotions change every day because for tomorrow show, you need to create a house promotion today. Uh, and for day after you will be beaming something tomorrow, right? So they tend to change a lot. Uh, it's also make, and they also feel very similar to ads also makes it a little hard to distinguish them. What's interesting is house promotions always feature actors that otherwise come in content segments. Okay, which is not so with ads. There is, if you have a mechanism to recognize faces that come up a lot in content segments, then you know that they are most likely going to occur in house promotions as well. Uh, content segments, most likely good actors, the lead actors of these content, uh, shows tend to occur a lot. That's why they're stars of the show, right? And the show metadata that you get from the EPG is most likely going to give you their names, right? And if they're reasonably popular enough, they have some web presence, which means that they have some photographs available there. So if we have a index of these photos from there, maybe we have a chance of trying to identify when these people come up on TV, right? So again, the broad team is TV has a lot of repetition. So can we generate a dictionary of features? The central theme of how we operate here is you're trying to watch TV over a period of time. And because TV has a lot of repetition, we try to gather features that are very, very characteristic characteristic of television. So we gather those features, build a large repository of this and we start looking for them again in live television. So for example, ads, house promotes and break markers are all repeating clips. They tend to repeat a lot. People tend to repeat a lot. So we can probably have repeating facial features. So the whole idea is, can we look for learned features in real time and then use that to tag streams on the fly? So we'll take you through this algorithmic sequence that we do. Those start idea. The whole idea is start looking for clips that tend to repeat often in television. Okay, but where do you look for these clips? Like what is the infrastructure? Right? Are you going to attach a computer to your cable television and do it? Like you need some place to actually do all this mining. So this is how TV actually comes to your home and it's important to know because you need to set the context for where you're going to do all this mining. So generally what happens is you have a TV channel. Let's say HPO, he uplinks his signal to a satellite and then from the satellite, there's a downlink and there are these things called multi systems operators who are your TV operators, the Airtel, all the cable operators, your Tata Sky DTH operators. They have these dish farms where they start looking for all incoming TV and then start analyzing it. After this, they encrypt all of this and push it through the wire to your homes. So what we've done is we've done a deal with these guys and we put our data centers in their premises means that we get untethered access to all the TV channels across India in one place for us to analyze. So we put our machines over there and we do all this analysis there. Other thing to notice is video processing is very, very expensive. If you're thinking of 60 frames per second, every frame, if you try to do analysis, you're going to like burn all your cash just in hardware and Abai is in fact talking about wanting GPUs, which means that it's even more expensive. So you need to come up with ways to make this more efficient. So what we do is we say, okay, ultimately we want video, but let's start with audio because audio is cheaper. And so we can't be stripped out the video from the TV and just start with audio. Audio looks like this. So it's right. But again, audio processing is is some waveforms and some frequencies and amplitudes and all this. It's still not a convenient mechanism for you to do analysis. So what we do is we try to convert audio into a number sequence. So there's this known method called acoustic finger printing. If you've heard of Shazam, which is a sound recognition system, you somebody's playing a song and then you open your phone and it kind of recognizes what song is playing similar method. The process is called acoustic finger printing where what you do is you take some piece of audio and come up with some number sequence for this. So at this stage, what we've done is we've converted a TV stream into just sequence of numbers. Okay. Now our job is easier. We just need to look for repeating sequences of numbers in in this long, long stream, which happens to be a very, very famous problem called frequent episode mining into in data mining terminology. And magically all these audio clip candidates keep coming up. Okay. They're not the best. You need to do a lot of post processing, but this is at the core. The core is we've converted it into a number problem and then we look for repeating sequences of numbers, maximal repeating sequences of numbers. And these are all interesting clip candidates that we can start working on. So what we then do is you have an interesting audio candidate. Now let's go back to video. And what we've just done here is we've had a large space of videos to work on. We've just tripped so much of this out because we have interesting audio candidates only. The domain says kind of drastically reduce. Now we are like, let's go back into video. And what we do is if the audio actually ran from maybe T seconds to T plus 10 seconds, we tend to take T minus four and T plus 10 plus four and take all the frames together. And then we do one more process of trying to do maximal, maximal. Frame mining now in terms of key frames. So here's an example of that. So you take this something else could have been before this something else could have been before this. Even after this, there could be something else. You take that and then look for enough samples of this and see which one tends to repeat often. Okay. And you do this. You end up with a clip. So you can see this example was actually clipped out of Sony sub the channel. It's not some clip. We just got from YouTube and we get it right from the first frame to the last frame. The next frame after this would have been something else. Right. This is just beautiful and we do this at scale every week. 12,000 such clips are actually coming out and bubbling up. Some statistics again, something that I went out. This is the output of the system. It's actually a real time system that's been deployed. People are using it to make money off as well. Why do all this? What kind of things can you get out of all this? I'll just read out some insights might be hard to see here. And from the data sets that we have, we're able to come up with these kinds of insights. What are the ads that are watched on TV by people with the Twitter app on their phone? Okay. How many of you have the Twitter app on your phone? It looks like everybody, right? You all watch TV, right? Now can Twitter know what do you watch on TV? Like you're all loyal followers of Twitter. Does Twitter know what ads you watch on TV? So these are two different islands of information, the one, the web, and the other. We're able to come up with this. This is interesting, right? A lot of Twitter users have seen the Barthiaxa insurance ad, SkodaRapid, JeevanSati.com, Datsun, Red Sun Go. Right. So this is huge monetization opportunity for Twitter itself, where assume that you are with the Twitter app, and then you're in front of TV, you're seeing this LACME, Iconic, Kajal ad come up over there. Now, and then you start browsing your Twitter, because obviously it's an ad break. You'll start seeing the same ad show up there. Okay. The next time that happens, you know, who's behind all this? Okay. So this is called TV synchronized advertising. Huge demand for this because you have a world of brands spending so much on TV and then increasing spending on digital media, but the two are not even connected. Another interesting thing, top ad seen on TV by Redmi Note 4 and MeFi. Okay. Both, both Xiaomi phones. And you'd say they're just two Xiaomi phones. But what kind of ads have people with Redmi Note 4 have they have seen? They have seen the Shadi.com ad. They have seen JeevanSati.com and Titan skin perfumes. What about people with MeFi phone? Okay. They've seen Berger anti dust paint. Honda City Tourism Australia. What does it tell what this has to stand for some real data? Right? What does it tell stand for something? This was not by choice. Okay. Shows a lot about the preferences. Like there is something when, when somebody goes and says, I'm not going to buy a Redmi Note 4. I'm going to buy a MeFi. Okay. You know, there is something in his mind, which kind of psych, psychographically associates with maybe he's a guy who will, who will yield for a Honda City ad. Right? So all these interesting patterns come up. Now, what does this data mean for Xiaomi? Right? Think about it. Another interesting thing. Airtel payments has been doing a lot of advertising. Now, what other apps are on the phones of people who have seen the Airtel payments ad? Okay. And we've noticed that the top app, this is a maximum likelihood. Everybody will have YouTube on their phones because it's all Android. We have noticed that the top app, the maximum likelihood is they all have the UPI app. Okay. And they also have a Patreon app. Does it mean something is going right with their targeting? They're able to reach out to people who seem to be their demographic, but if this was not there and they were reaching out to something else, then you know that they better change their strategy. Right? Okay. How many minutes do I have? Okay. So I hope you're all convinced about this. So there was a huge need for building the system and not even looking at all the use cases that are made out of this. Multiple, multiple different use cases. We're seeing interest from OTT operators, advertisers, like all kinds of people. Uh, now the question is now this guy has come here and is talking about his built the system. How good is the system? Can you even measure and tell how good the system is? Okay. Does it get all the ads? Is there a ground truth? What percentage of ads are you able to uncover from a machine learning perspective? So that's what we wanted to kind of come up with. Like there is no ground truth because ads are changing every week. In fact, there is one ground truth. There is a almost a semi-government agency called bark bark stands for broadcast audience research council and what they've done is they've gone to all broadcasters and said, okay, put watermarks and then they've installed people meters in people's homes, big team, very, very and in fact, they pay people to watch TV, right? So they're going to watch TV whole day and one channel per person and whenever an ad comes, he puts and makes an entry into Excel and says, okay, this is that that's coming up, right? So that which is why the data is very old. You this is how I wanted to be good. So data is very old. So you don't have a real time grasp of how good this is and right? It's very sparse. So we want to come up with a way of telling how good the system is. If you know the terminology of precision recall and the F measure and kind of important and it's important for data scientists to have some discipline and how they're doing this book. So for us precision means was everything we called indeed an ad. You don't want to randomly arbitrary call something as an ad, right? Everything when you say it's an ad, it better be an ad, right? That is precision. If we say we are a hundred percent precision system, then every time we make a call and say, hey, ad, it should have been an ad. If it was not an ad, then you lost precision. Okay. So recall is did you catch all the ads where there holds in between? They're like, did you miss some of them? What's a false positive something that wasn't an ad, but tag doesn't add a false negative is something that was an ad, but we missed it. Because this supposed to be an industry ready system. We said, okay, let's focus on hundred percent precision. This is also something that I kind of learned while working at Google, which is always focus on precision more than recall, although people say the F score is something the harmonic mean of the two is something that you should aim for, but industrial systems always say go for precision because humans do not like errors. Okay. And we see this a lot. We have our app out there. We have a product out there. One bug and they'll say this is the worst thing that ever happened to humanity, right? People are like, you've seen this, right? So people are very, very unforgiving of errors. So you need to build systems that are very hundred percent precise. So we said, let's put a human layer. Every ad that goes out of her system. And even though our machine says it's it's it's an ad, we better get it validated by a human. Okay. Now, then there's a question of how efficient this human is going to be, but we can park that for now. So we built this tagging utility for him and so they can just play the ad and try to figure out if it's indeed one and there are such substitutions over there and they just quickly market. It takes about one to two minutes to verify one ad. So it's a very scalable thing. We just need like three, four people to get this job there and we are able to be up to speed. Okay. Yeah. So how do we measure recall? One interesting insight that we got was had breaks are always coming together, right? All ads come together. In fact, the way shows are created. How annoying would it be for? The show starts out then two minutes ad break comes up and then like another two minutes is like that. It's ideal for showing ads, right? So every over is like three, four minutes and then you have an ad break. But typically ad breaks are all bulk coming in in bulk. So it's easy to identify if you missed an ad break in the whole timeline. If you missed an ad, then you start seeing holes. Okay. So look at this. Uh, you had a long content stretch and suddenly our system is saying a ad and it does so for like three, four seconds. Then it's a false positive. Right. Similarly, you had a long stretch of otherwise an ad break. Okay. And then somewhere in between there's a hole. This is a false negative. So statistically, we converted this into a formula and we said, okay, our recall is the duration of the full ads that we matched divided by the duration of the full ads that we matched, but also add all the holes that you kind of left out. It's a very good measure of recall and we use that and these are actual numbers. We continuously track this for every channel for Sony. There's 93.7% recall 86% recall. We can never achieve 100% recall because we need to learn a new ad, right? And it takes some two, three times for it to come up and repeat and for our old system to kind of get set. So we can never get it to 100% recall. But this is actually pretty good numbers in that sense. Okay. We're out of time and then there are ways to improve this and there's a lot to be done and done. Thanks a lot, Bharath. That was a very interesting presentation. We will not take QA right now because we're out of time, but there's a separate QA session by Bharath and next speaker Paul after the short beverage break. We'll break for a short tea break right now. Do you have feedback forms in your seat? Kindly fill them. This makes us help our conferences better. You can pay for food using Paytm or tokens which you can purchase as HattGEE counters. There's OTRs happening today on in the room one as well as here and the OTR on data visualization will start at 12 o'clock in room one. So you could go there if you're interested in these things. Also submit your proposals for flash stocks. We are going to have flash stocks today starting at 5.40 p.m. So a flash stock could be anything from your five minute presentation about some open source project, some interesting idea you're working on or really anything you want to share with us. So do submit your proposals. Thank you. Check, check. Hello everyone. Good to have you back. Hope you enjoyed your tea. We are starting with our next session with Paul. Paul is going to talk about designing ML pipelines for mining transactional SMS messages. And I would, I cannot really stress this enough, but please switch off your phones or put that on silent mode. It's not good when your phone rings in between the talk. Thank you. Over to Paul. Hi everyone. Good morning. Good morning. Right. So, so SMS is a pretty rich source of personalized data in India and they sort of applicable to a lot of use cases, especially in fintech, personal finance. And if you're a data scientist or engineer interested in this kind of data, then hopefully this talk is interesting. But also I think if you're a data scientist or engineer who's interested in complex problems where you're layering models on top of each other, then in a more generalized way, it might not be with SMS's. And hopefully this is also interesting to you. And I also kind of want to have two broad theme takeaways that hopefully come out through the course of the talk. And the first one is an illustration of a, the simple but important concept of breaking something into pieces and solving the pieces. And I like the way that the mathematician Max Teckmark puts it, where he says, you know, if you have a tough question that you can't answer, first solve, first tackle a simpler question that you can't answer. It is a way of expressing this concept. And the second takeaway has to do with the architecture and design of these systems. When you're applying machine learning to a problem, at least in my experience, you almost never fully understand what you're building when you start, which means you'll need to do things 25 and 50 percent and 75 percent into the project that you didn't realize you'd have to do at the beginning. Unfortunately, you also face things like path dependency and early choices affect late choices. So I think a key concept to think about when you're thinking about how do I design the system in the early stages is the quality of extensibility. And in my experience, a lot of engineers and startups make the mistake of making a big issue of scalability and while forgetting about extensibility. And I think this is a shame because you're far more likely to, we're far more likely to face the problem. We can only hope to face scalability issue, you know, massive numbers, but we're almost surely going to problem face the problem of we're going to face the challenges of breaking down because of extensibility. How do we extend our system to do things that we didn't know we would have to do? So a bit about actually the doesn't seem to be connected. Ah, there we go. Appreciate it. So a little bit about me and I think also just to give you some context for how I think about this talk. My career is kind of my professional experiences kind of oscillated between two key themes. One of which is being statistical inference thinking about how do you actually ask questions about causality from data and in order to inform decision making. And then the other side is the engineering and the software because you can develop really complex and interesting models that help you, you know, apply causal inference, but until it can go into production until you can deal with all the dirty problems of getting data from one place to the other or making a decision and then recording what happens and did it really? Was it really the right one? It stays pretty academic. And so I kind of gone from, you know, spend time in my graduate work in university and in government where I was thinking more about the statistical inference. And then as I've moved into the private sector, it's become more about how you engineer the system so that it actually works. And and most recently, I co-founded a startup called PaySense, which is a mobile lending startup here in India based in Mumbai. And then recent and this is kind of where this problem set came out from me and sort of represent some of the work we did there and some of the work I've continued to do afterwards. And now I'm actually at a venture capital firm where I'm a data scientist in residence and called Mountain Ventures and I do want to just give a quick plug because I think one of the things I like to think about is when you're investing in things to the notion of really understanding the product and technology that's being built that a lot of times entrepreneurs care about is something you miss out on if you're not from that space as well. And so we're really into thinking about the problems that the engineers think about that entrepreneurs think about that aren't necessarily related to like necessarily just is it a good business but are you actually building a technology that's interesting and that's it's going to be applicable widely. So if you're in that if you're interested in that kind of thing by all means come talk to me. Okay, so this is what we're here to talk about today. And if you have a phone in India you know what these look like you have them on your phone. And interesting interestingly what differentiates SMSes which is a worldwide technology in India a little bit differently than in Singapore and us other places I've lived in worked is that in India because of the requirement of two factor authentication authentication banks and other financial transactions that happen online you have to involve approval through the phone. So we have like things like OTP and a side product of that is that because the phone is such an important part of everyday life anyway that a lot of the services that we interact with as consumers as people all have some element of the phone in them so you'll get your notifications from Uber and from and from Amazon as well as from your bank and and so it becomes a very as I was saying at the beginning a very rich source of data and the technical problem here ultimately when you're thinking about the use case of say giving a loan what you care about is how much did this person spend and what's their typical balance bank balance look like but what we have as a raw form of data is an SMS that looks like this and the machine doesn't know that the that you spent 4000 on the credit card it just sees this sort of raw text and so the technical problem we're going to talk about today that sort of statistical problem machine learning problem is moving from this to this in an automated way where we can sort of say every message is composed of a template structure and then variable information and so it's it's not your typical nl it's not what I think a lot of times people talk about when they're talking about nlp right it's this is the problem is in a lot of ways not as hard as something like Twitter where you have just no structure anyone can say whatever they want this is there's latent structure here it's just the variable amounts that are changing you're all receiving the same message but it's not so simple as that in the sense that there's there's a lot of variation in the templates and other things this is the general problem we're looking at and and and why does it matter maybe you do care about the use cases and not just the technical issues here this is a fintech market map in India and I would say I've looked at most of these companies and most of them are at least asking permission which you may not read when you click okay and install the app for them to access your SMS's and trying to use it in different ways and so so I don't think this is like this is not like a cutting-edge problem but I don't think necessarily that the solutions to it have been done as well as so I often hear people say for instance that this is a commodified problem I think that's similar to sort of saying if you're in 2000 you said search was commodified right there's definitely been search systems around but the the extent of their development and evolution is not is not quite there yet so and you might see that like so in a personal finance we have companies like walnut right and they'll often not get things right where you have maybe you have two bank accounts and you make a transaction and you receive a notice that you it is a debit or withdrawal on one of your accounts and a credit on the other and knowing that that's actually moving within your system and is not like a net outflow and then inflow is something a lot of times they get wrong okay so before we jump into the system I also want to kind of point out that this is not I've talked about this problem before and a lot of times people will come up and ask you know how to can we access some of this data because you know it's interesting and what I do want to say is that while you won't necessarily as a as an independent person you won't necessarily be dealing with some of the problems at scale this this is not a problem that you can't touch on grab ahold of and try your hand at just because you don't have an app out there that's collecting other people's data so by all means and I'll just sort of walk through a couple of ways you can get started so say you have an iPhone which is the kind of phone I have which won't necessarily get in your app but you basically go to this fairly obscure site if you do take an image backup of the phone and you'll find this very obscure labeled file which actually ends up being a SQLite database and there that's where your messages are stored if you have an Android it's a fairly similar process I haven't actually tried this too many times because I don't have an Android phone but it's worked once and then that'll give you your messages and maybe the messages of some of your family or friends who are willing to like help seed your data set there are other ways to kind of try to sort of initiate even more very variety in your sample so there's a lot of S and bulk SMS service providers out there who provide templates provide other message structures for you to for basically wraps who wanting to send bulk SMS is you can Google those you can pull them down and then all of a sudden you've got a whole new set of templates and then don't stop there because I won't give you enough variety I think the next step is simulation and actually this is a sort of a side note that I think this is actually useful even if you don't need it you have the data because creating this kind of data actually forces you to really understand what the data generation process was weirded how does it actually come from and it'll it'll help you think through this so what we're doing here is just a basic Python script using regex and some strings so we have a template across the top of message that happens to be a credit message and then we can think look the way this is phrased in English is is pretty arbitrary you could compose it any way and in fact we do compose it different ways so let's just create some artificial variation that's sort of syntactically different but not meaningfully different and then we'll just create a script that will cycle through messages like this and randomize and vary it and so then from one template you've got many different templates so instead of saying has been deposited to you can say credited to or deposited into or deposited in or credited in as humans we don't care about that variation but that's the variation we're going to try and ignore and try and decipher when when our machines are learning so and we do need the variation because the variation ends up being the problem if we only had like five templates for a bank or even thirty five templates for a bank that had all the different types of message then you could just like manually look in them and write some regex and brute force it but that's not the case that we're in so in India just doing some basic easy research you'll sort of see we have hundreds of banks of different types as well as all the kind of service providers that are also recording and SMSing things as to your transactions and other things that that might be applicable to the use case you have as a business or or as an application and so because they're all creating their own templates own message structures and entity types this is where the variation comes and this is why the sort of brute force raw just manually labeling these messages is not necessarily going to work very well okay so now we kind of set up that context and we can think about how do we design to the information we ultimately care about so from here on out let's sort of set aside a lot of the type of SMS is and focus on banks those are really interesting those are financial data and actually I should actually just pause for a second and say before we get into this sort of statistical stuff the other concern that you know is definitely there that also this helps you make aware of is that there's a lot of personal identifiable information in here and this is in SMS this can be all kinds of private sensitive information from from whatever your financial transactions are if you're very sensitive about that to to to Shahidi and other sites that are very personal and one of the one of the things you also want to do when you're designing these systems and we had a panel yesterday where we talked about ethics and accountability and things like that for me it's very important you should be thinking also about when you design the system how do you design it such that you minimize the information that you don't want to capture just a starting ways sort of the baseline is that there's every every sender that sends bulk SMS is required to register with the government and they get a six character ID address which is different than a phone number when you're SMSing personally amongst friends and family and so when you're setting up your SDK to capture this this data you would sort of exclude the personal information at least the personal SMSes and only go for messages coming into the sender sort of a six character ID which is a transactional SMS okay so we we have those SMSes and we say let's say we're taking the lending case we're interested in we're really interested in these these particular pieces of information we debited 300 rupees or we debited and our balance is 4946 and so this is like this is our end goal but this is not but there's also like this is just the highest level so a debit is a basic transaction a credit is a basic transaction as you go in you start to realize that there's a lot of also specifics in your ontology where you have to really be thinking about your domain so if you're if you're lending money you care about for instance is a person making their payments on time so there might be a do notice and you want to but you don't want to classify a do notice as a debit right it hasn't happened yet you have your late fees you have your your bounce checks your double balance checks there's a lot they all mean something differently and you you can't just capture a list of amounts right you need all of the structure and the context for that amount for it to be meaningful and that's what we what we want and so we say how do we go from like this raw text down to this level like where what what's our handle do we just build a model that says like let's say classify this message is having a debit and the 300 is the debit or so this is where the design you start to think through what do you actually want to get out of it what kind of layers are going to be there and how do you even structure a model so the first step is let's say we're interested in banks right but we have a whole set of sms's and I kind of just easily said we're interested in banks but the machine doesn't know which messages are for banks and so as a start we'll just sort of let's say we're going to classify messages that are coming from banks and so you might so that's you think that's going to be pretty straightforward right because there's that latent structure in bank messages and there's a very commonality of topics so bank messages will tend to have sort of financial terminology and then they'll have things like account strings they'll usually have a date string because it was a transaction so it has a time stamp date stamp associated with it so those things will help us identify a bank and you'll like classify a message whether if we do that we're actually leaving a lot of information on the table right we're we're solving a problem where we don't necessarily need to like look at a message and decide whether it's from a bank why because the sender is the bank and the sender sends a lot of other messages so if we jump right in and start classifying at the level of the message we're leaving all of the other messages that that sender is sending on the table so instead what we can do is just let's aggregate all of our messages that come from a particular sender and then classify the sender so we know now that HDFC BK means bank and then we we say okay now we have our bank messages before we can really get to the 300 rupees we need to know the context of it so for instance you're just at a very high level we have debit messages and credit messages and do an overdue and then we're again we're going to say we have a more limited set of of ability of text because we know it's bank messages and it tends to be one of these types so we're hoping that there's structure in there that we can sort of identify and classify to know that this message is debit and this is a credit and that's really important because if you get it wrong you have money coming in when it should be going out or vice versa and that's going to mess up your your risk models okay and then and then say we're interested in a person's bank balance over time well we we could model like where the bank balance is in a message but it doesn't always show up so here's Citibank and here's two type of messages from Citibank that are both debited messages one contains a balance information and one doesn't right so we wouldn't want to we wouldn't want to find ourselves hopefully in a position where we're looking for the bank balance in the message that doesn't have the bank balance number is the balance or or the debit amount so we also need to classify the entity type that we're looking for so let's think about how we would do that because now we're not just classifying the full message which was like for a bank it was just a full set of messages for a debit message you're just classifying you don't need to know anything about the positionality or anything in the message you just didn't know this is a debit message in it but how do we know where it is so one way to do that is like say well we know that like a balance amount will be a currency amount so we can say let's make some rules where we just I isolate numbers that only contain say a period not not associated with any so alpha of the alphan numeric or dashes or anything like that and so now we've identified and we can do that just with like with with a parsing structure and then then we can say okay once we've done that we still need to know which one of these numbers represents which one of the entities so one way to do that is sort of take a sub-segment approach where you take each of those messages and you say you have some n say three or four and you do n plus three n minus three so you get a sub-segment on the message and now you have that number and you can classify the sub-segment so now because if you look at these two sub-segments you can kind of see well we'll be able to learn that this the 12 thousand forty is the debit and the one like thirty thousand is the balance so each one of these is a model and we're kind of layering them on and we're not trying to solve this final objective until we've solved the earlier and it's solved the earlier easier problem and we have our pipeline and it kind of goes through so we're receiving SMS's and then we're sending them into each classifier and we're figuring out when to store and when to send and when to classify and ultimately our financial structure data goes into our database and the system looks might look like this at one step so I'll read it because I think unfortunately the resolutions are great enough for you to be able to read so we we have a new message submit to the the platform and it kind of goes through the system in fact I have some more granular versions of this so I'll wait for that but what I kind of want to point out is that is that in this system even even though you've broken it down for me it's like all it's also important to understand the human machine tradeoff humans will always be able to do highly sensitive sort of analysis we can kind of recognize variants difference very easily machine machines don't do it as well so like have figure out how you can kind of combine the two so so we see receive a message let's say we're just classifying at the bank level we compare the list the sender to a list of known senders and we determine their identity and if it's unknown we send it to the model and if it's known we assigned a category right great now say let's say we receive messages without with an unknown sender so then we might so we'll go through some transformations on the data itself well I'm anonymous strings containing numeric values I'll talk a little bit more about that we'll hash some n grams talk a little more about that in a bit and we'll predict probability of sender category we'll determine if it's the probability is clear and if if the category probability is ambiguous or not if it's not then we're going to go ahead and assigned it if it is we're going to send it to the user what I mean by user in this case is especially if you're building the system like as a lender you really need to understand the data that you're doing underwriting on top of so it matters in this case that you're getting it right so what we did was built out a visual interface and for that you know for the 5 or 10 percent where we're unclear for instance it's very easy to tell the difference between a bank message and like tends to be like a uber type message or an OLA but there's a lot of n b f c's out there that look very similar and so you'll get end up get you so where we get our false positives is with with that and so we built just a basic system high-level interface that would allow our customer service team and other teams sort of sit there and as messages came through when we weren't getting the classification right or we were not very confident about our classification that would go through to them and they would be able to sort of say oh we missed we missed this one it's wrong and and then they would relabel it and that was actually also very important for us that's where a lot of the change over time happened where we realized we had to model things that we didn't think we had to model before where our risk team was like wait this is a late notice but it's a third late notice you can't classify you need to be able and it's not the same late notice it's a late notice plus the the amount is increased so you need to actually be classifying that separately not and not the same way you can't just sort of say and that matters because some banks will like send you late notices over and over and over again they're just excited to get their money back and some banks will just do it you know occasionally and so that's that's a sort of that shouldn't be related to the person's propensity to repay and so you don't want to you don't want to have that noise in your confusing your signal so then there's a lot of like the details right of the system as you're as you're building it out so well a lot of times with the class you know at the earlier stages what we're doing is we're classifying the template we're not classifying the entity right and so when you're doing that you're really looking for in that early version remember you have the template and you have the variable and you're really looking to classify you're looking to kind of exclude the variable if you have the variable in your data what we're doing is basically vectorizing text you're going to have a much wider data data data set right so some messages were just you know a debit amount that can be any number right and so that but it doesn't really mean the difference between a 300 and 100 rupees is meaningless for your model so like you don't want that there and there's different ways you can do that depending on like your throughput and you're and how much where you want to do your computation one way is for instance it's like say your messages are being passed into into a like a post grade database some relational database when you're actually pulling in the data say you've received enough messages from a new sender to be able to classify run it through the model might want to clean it as you're querying it and that works and that works okay where if you're just sort of taking out numbers then you might get more complex and you might say no we're going to do we want to be a little more variable and how we clean how we transform the text before we sending it to the model so we might do it in a script so for instance we might want to clean dates and then we'll write some regex and we'll we'll think about some logic to be able to identify the dates because dates like amounts it it shouldn't matter right it doesn't matter that it's March or April that doesn't make a difference for identifying a debit message and then we'll have other problems like like size so if you're lending and people are receiving 20 30 40 100 messages a day or every couple of days and you've got tens of thousands of users you're quickly getting a lot of weight so there's different ways to handle that I when you're early stages actually before I talk about the the size let me talk about pipelines so because you've have all the it once you start to do more complex text transformations you might we were talking about pipelines in a more abstract sense of the course of this talk but there's also more particular pipelines like SK learn has Psyched learn has have called pipelines and what that does is allow you to package a lot of the transformations a lot of the cleaning that you're doing and that you might be doing before you you want to send the the data through the message through the model but in a very clean and like in in packaged way and so basically it's just like let's say you identify your vectorizer and you give it some specifications maybe you have some missing data imputation or something else which uses some other data set to like find a a value for them to say or other types of transformations and then you can basically pass all of that to a pipeline and then subsequently you can just call that classifier and all of its nicely packaged and I think that ends up saving you a lot of effort when you've got different text transformation methods being applied to different stages of the model being able to like make sure that they're all cleanly in one place your your engineers your data science when you're looking back may not have written that code you certainly won't understand it even if you wrote it a couple of months ago so when it's clean like this I think this is like really a good way to do things and actually I want to give a talk plug because there was a talk at Piedata Chicago last year where I talked about this I think it's a pretty good concept especially for when you're doing this thing of layering a lot of models on top of each other I so then we can talk about size as well let's say we read we we have a couple of million messages that's a little more than we want to deal with as we're sort of testing our model so you can read it in chunks sort of put it into an iterator and you can use what it's like it has hashing vector vectorizer and what that does is allow you to basically do all the things sort of up basically you're doing a partial fit so you'll cycle through your iterator and you'll do you'll do your transformations and you'll do you'll apply your classifier with a partial fit so basically it's just updating the model on the segment of the data and doesn't need the entire vocabulary and you have to make a lot of choices when you're doing this kind of thing so for instance with the hashing vector vectorizer it's it's great because it's low memory scalable the large datasets right like I said it doesn't need to store the whole vocabulary dictionary and memory it's fast etc but you can't necessarily before I was doing a tf idf I was doing a particular kind of vectorizing and you can't do that with this because it doesn't store the memory and often what that means is that if you're trying to understand what the model is doing like which features mattered you won't be able to to do that if you're using the hashing vectorizer fortunately in this case right we don't care about that because it's not we don't need to account for why a message is classified in a particular way as long as it's accurate as long as the model is accurate we don't really need to know like what actually were the features that cause this to be a a do notice and this to be a late notice and so there's a lot of these micro and over time you get quite a a large complex system and if you've if you've sort of taking care at each piece level when you're going back and changing it it's fairly it's much easier to go in make your change to the piece update the way you do it expand the amount of information you can capture without sort of taking down the entire system and so here you know we just we receive a message we do the classifying of the sender we send to the model we do all the we repeat for messages so we're sending it to the user so that they can clack clarify ambiguous ones and and then we're doing that at each level of the model and and that's the kind of system we get so I wanted to end a little bit early I think I'm successful so that we can have room for questions that's it I would also invite Bharat on stage and Bharat and Paul would have a join Q&A so if you have any questions as well as this Paul's presentation you can ask any of the speakers questions question to Paul you have mentioned that we can find the probability of the category of the sender so when the sender is unknown so how this probability is calculated well I could how the how the probability to find the category of the center when the sender is unknown is actually calculated so are you asking about like for instance the accuracy of models well okay so that's what the model is doing right and and we you can use a lot of different models a multinomial naive base basically ends up working well and so we're taking this SMS we're breaking it into pieces we're applying a classification model we don't know who the sender is so we're classifying it as a bank does that make sense hi this question in case if when customers transferring money from one account to another most solutions are not able to do that how would you approach that problem what's how would you approach the problem when customers yeah so I think then what you're doing is getting down to the level of the actual individual and so they are ready which I didn't talk about here but we're actually classifying what kind of account it is is it a savings account the account number and because different messages will anonymize nobody sends your full account number right so you might have last four first two last two there's different ways so you're understanding what those structures are and then you're like creating a library safe for each user of their account numbers and then you're assigning a probability of whether this particular transaction is associated with this account and this account and then you might be integrating information like how the balance moves between two messages so what we're doing is is is modeling things at the level of the actual user it's like you can classify a message across all users now you're getting into the level of where you're actually saying for a given user you'll call up their accounts call up their previous balance and subsequent balance to see if the amounts add up and that's how you get to that problem I what I think none of these problems in and of themselves is that is that hard but other so that you can do all of them in a in a pipeline I think is where it gets complex hello let's say one day one bank changed their format entirely I mean if you talk about date and balance and all these things this solution applicable for that has been yeah it's hard to hear you okay so so let's say one day bank changed their format especially entirely so this solution will be applicable for that has been you're saying if the bank changes the kind of message it sends yeah yeah so that's exactly what you're trying to handle it's like either a new message from a new type of bank and you want to know very quickly that it's a bank and it's a debit et cetera or the same bank changes its template structure and you don't want to have to read manually classify everything that's exactly the problem you're trying to solve for this questions of Paul so these days we see a lot of message for example with you you do one transaction and you mentioned you get an OTP and then for example you do some reason it cuts declined how do you tackle those cases to see okay the transaction went through or there's something declined or something like that you're saying the OTP is declined no you get an OTP and you entered it incorrectly and the finally transaction got declined and you did it again so there will be multiple messages with same amount and all of that so there's a lot of noise how you manage that yes okay that's another example of like where the complexity comes in it's not just the problem of the OTP being declined or like a lot of types of messages are sent twice a lot of times you get ding ding it's the same exact transaction the time stamp on the SMS will be different but the same say it's the same message you don't want to classify that twice as two debit messages so again you're modeling at the level you're sort of tracking at the level of the user where you're sort of saying you're you keep if you're classifying a debit message you you have rules like time spaces where it's a probability of it being an independent message given how early like how when would the previous message was a lot of times it's not the if it was the exact same message it would be to classify but if they vary if they send you two different types of message one contains balance one doesn't right you still need it's not the same message but it's still basically the same transaction it's it's how you do that is like is again layering more of these models onto each other so that you can correctly identified that that's happening yeah my question is around the quality of data you collect from sms is the quality of data you collect from the sms information yeah so suppose you're trying to build a balance sheet for a customer you know let's say he gets salary every month but there are like issues with network sometimes message won't get delivered or the person can have multiple phones he might be using in those instances how do you deal with the accuracy of the data you collect and provide a confidence that the that the data you have collected is good to use yes um these are these are great questions they really get to like the trickiness of the system and why it's so important to have each of the systems be very modular so you can go in and update it so in the early days when you're first doing it you remember we we classified a sender as that six character ID and then in the back of the system you've got the phone number of the user yeah and the phone number has a messages associated and you might just be tracking phone number to messages received but then you want to go in and actually change the user identifier from the phone number to maybe some other unique identifier that has multiple phone numbers associated with that so that you can then realize that it's the same person more interestingly you might actually want to be able to you may not know that the person has multiple phones but you want to actually be able to sort of identify that from the messages and sort of saying wait a second we're pattern we're seeing like half the messages being received one from the other but it's the same account and if you merge those financial data you're actually seeing like continuity there or you're seeing family members sharing an account and both of them transacting or something like that so that's again a problem of like can you go into the system and like not mess up all the things around it but just add that additional level of complexity so that your final output is much more granular and defined questions. Let's say you have to train all the SMSS and in a distributed way. So you don't have a way to pull SMSS from users and load it into a server and do this complex analysis. Is it possible to run the training as well as the execution in a distributed way in users respective phones? Run it in the phone? Yeah. Run that training of your model yeah and insights of the model although the training is happening in a distributed way the learning is collective. Is there any way like that? I think that's a interesting question. I'm not sure I would like want to venture a response because that's not the way I we approached it but I think it it would depend on like to agree to which you could store models in the phone. So you definitely need to like at some point get outside of the system so you have enough data train etc. But maybe you could deploy the model on the phone. I'm not familiar enough with Android development for instance to be able to like know exactly how that would work but I I've heard of other use cases where you're trying to basically do entirely localized prediction where you but I don't know the exact mechanics of how you store the model on the phone. You usually it's in so like an API call. So you as a well counting the credit rating of the person do you factor in only the transactional message of service provider message also for example if I book a ticket from clear trip and but I don't make payment using my card someone my friend's card so I will not get the transaction message I'll only get the service provider message so do you factor in that or you only look adjacency that only when a service provider as well as a transaction message comes yeah so the goal of the system is to is to like try to incorporate that information it's an iterative development process you're solving the basic here and then you're doing more and more we were getting towards that piece I think it's not like fully integrated but a lot of times this matters for other reasons then like it might not just be that you don't you're like using someone else's payment method but also it might just be that like for some reason it's historical so once you get on once your SDK is there you can make it such that you receive the message so you don't have to worry so much about missing data that kind but for some reason the phone might have been off whatever you don't receive a message you have a lot of missing transactions and let's say you didn't receive ever received the bank transaction but you made a purchase on Amazon right you can match that purchase against that transaction and sort of say look we've got a gap if we if we look at balances across these messages we see the balance going like this or let's say more realistically it goes like this we got we saw transactions transactions transactions transactions but we have a gap and the next transaction had a balance lower we're further down the slope we don't know where that happened but now we see an Amazon transaction happen and we're able to place that into the into the process right but now what you're doing is once you've got into the structure data then you're doing the financial modeling where you can kind of look at someone's account et cetera before you can even do that you need to know like the transactions on Amazon transactions on the bank and then you can interpolate and like bring them together that's the next step in the process this is purely the extracting the information process then once you've extracted it then you can impute missing data you can match patterns you can do all those kind of things so there you might recognize that and say what I'm going to do is scan my messages for the but not the timestamp because you don't know when it'll be rather you want to like say take a transaction before and after that one and see if you can cross fit it like basically identify it as the same transaction amount that kind of thing yes so yeah so this is another example of like where you get a lot of false positives you're classifying bank messages because Paytm messages end up looking a lot like bank messages which you that's why it's important and at the personal level is to be able to identify the account number information so here's your account number information you might have different formats of account but you're modeling that and you that's what I was saying to the previous question you have a library of account numbers for a particular user associated with which bank right so it'll be like your HDFC banks and your Citibank accounts and then your Paytm's um that's that's what I was mentioning also that I'm not getting it right where it's so you that's really money your system when you send it into Paytm or you send it back into your bank so you need to be able to sort of say when you see a debit transaction does that debit where is it going into is it going to into an entity or is it going into one of your accounts so that's again like the design of the model where you're saying this is the problem we're solving we insert this level of models into the overall system so it runs in parallel or in sequence with all the other models for Mohan I want to know like as new content is always coming different show and this is always keep on changing like next they will show some clip from the next week's show and the next week again they will show some clip from the next the following week so how do you keep on training you are updating your model the content is always changing okay so let me just understand your question so you're you're saying the the clips will keep getting updated and are you asking how we can how we monetize that or how we keep updating the model yeah so we don't have a model right so it's an unsupervised system if that answers then but so the way we extract stuff is which is exactly why this is the case of a system that that requires an unsupervised approach because the data is changing very dynamically you'll always have a new kind of thing and and training for something is going to take up more time but unsupervised methods work out better because there are statistical patterns that are there that we kind of make use of to bubble up so yeah but the models are used to make the system more efficient so to auto classify a clip as an ad versus house promotion you can use models that are trained from already observed data if one stacked you can a clip that is already tagged once you can use that to put it into database and learn some features so something related to this ad comes up later on you're able to classify faster but largely it is an unsupervised system think he's waiting for a long time there hello question for so do you plan to you know open this data for public any time in future open data part as in you know just like you know for Google maps you know Google date yeah so ad breaks is is an open system we give you high-level analytics to kind of observe whatever patterns that are there okay some parts of it are copyright protected for example the clips that we observe are actually extracted from broadcasted feed right so they're mostly for private consumption that's there high-level statistics absolutely available for public you're saying you want a full firehose of data something more for developers you know that you know maybe because you've got a nice set of data probably I don't know if anyone else has it and if you could you know open it for developers no you can actually consider that so or even if you want to do an internship or something like that or join us on board there's a lot of data we can thank you hi just wanted a little bit more about the monetization model importantly are you targeting the end customer themselves okay I think at least I couldn't catch like for example as a customer I'm watching I'm using this what's what's in it for me like I mean I just logged into your site it's probably good for the broadcaster or the service providers but what isn't it for the end customer because you know people move the moment some ad comes I switch to the another channel I know what I got board for me and how can you get it personalized for the user themselves the people who watch yep ads yeah so it will trickle down ultimately causing the experience for the end consumer will be much better and I'll explain how it'll it'll happen the reason why you see an ad 200 times is because they don't know who's seen it right so there's this famous thing and advertising which says 50% of all TV ads spend is wasted but you don't know which percent it is right and there's also a lot of psychological research and in advertising which says that brand needs to be present probably five times but and until then the recall is not there but after five times it has a diminishing effect of of diminishing returns and has a negative brand recall because you just hate it you're like why am I seeing this again notice is the first time you've seen an ad you've always liked it because there's curiosity and there's immense amount of creativity that goes into creating an ad so creating a feedback system actually makes it more efficient right because advertisers will take a fewer spots because they know they've had these five impressions they don't want to sell more so slowly you will see that the ad breaks will start shrinking if they know who's watching it and better ads will be able to reach people kind of benefit from that now the the truth of the industry is you need ads to survive because you're not going to bear the cost of content right so it's content is expensive so the only way this can scale is through advertising and good products also need to come in front of consumers so better data insights helps the entire industry and ultimately it's gonna it's like GST is gonna help the consumer right but but it's meant for the industry like streamlines the whole process so that is one the other is is what we are also creating is is an X-ray kind of experience for for television but you can search better so in fact we are building systems where you can use voice and tell like take me to an English movie that is not in an ad break enough right so in general it helps the experience there or you can set an alert and say like come back to this channel when the ad break is over right so these are all interesting experiences even for consumers we want to be careful about how we do this because it's ads are the bread and butter of the in fact we had a we had tested this for some time and then we are like no it's not the way to go we removed it questions hi I have a question for Paul so looking at a content I see it is kind of syntactically correct so have you tried exploding anything on the syntactical parsing of NLP like part of speech or dependency parse tree like one example that you had account one two three is debited amount this much is debited is the predicate and you have subject and object so have you tried leveraging that information somewhere or have you explored that yeah yeah so there's definitely aspects of this that are you could I think leverage NLP methods I think there's some entity extraction I think that that a lot of times the way I've seen for instance entity extraction done sort of like a named record entity recognition type thing is is where it's like you're looking for any particular named entity it's not necessarily that you're you're looking for this particular one over multiple iterations where it's variable but not in a meaningful way so like everyone's account number but in sense of taking an approach where you say you're looking at part of speech and like you said you have your your your verb debited et cetera we played a little bit but it was it it wasn't as generalizable to a lot of the other problems so like fitting that type of model was theoretically possible onto the system and that's the point was like you can leverage different approaches within the same system like any model at any different level can be its own and use its data in its own way it doesn't need to like use the same data as the previous model but for in terms of just writing speed of a data scientist doing the same thing multiple times is easier than to like context switch and sort of say now I need to understand a little bit what's happening with the NLP I think there's there's some good potential to do even more than what we when we tried out and that's why I kind of like encourage like oh there's this is how you kind of get a day get it data set because then I would love to sort of see different methods applied I also think that like certain certain unsupervised methods could be really interesting because there's in a sense of clustering kind of away I think there are some some problems there but anyway there's definitely more that could be done and that's what I was trying to say is that like a lot of times people see this and they're like I've just heard a lot of people say it's commodified right it's a solved problem but just like the tiny bit at the top is solved like there's all and doing it efficiently effectively across all the different use cases that were I think it's where it gets interesting so but I think this type of NLP is is definitely a candidate for further exploration okay thank you questions okay I don't see any more questions thank you Paul thank you Bharat there's a very great presentation ladies and gentlemen we will break for lunch and return here exactly at 145 meanwhile you can pay for food counters using the Hasky counters we accept card and cash at the Hasky counter but not at the food counters from 145 to 230 p.m. we are going to have an OTR on failure stories in machine learning led by Bargawa S if you have any failure stories to share check hello hello people welcome back from lunch hope you had a good one we would be starting with Manas Ranjan Kerr's talk about building serverless architecture for deep learning and nlp he works for a company called episodes and we'll starting with the talk shortly we have distributed feedback forms for the evening session they are different from the ones distributed to you in the morning they were for the all the morning talks the evening forms are for the talks starting after lunch till the end of the day so you would want to go ahead and fill back feedback forms they help us make the conference better for you hi good afternoon so before I start off I'll introduce myself so I'm Manas and I work for a company called episodes I lead the data science and nlp practice and we have been working on building clinical intelligence systems for the past few months this talk is mostly going to be about how we are building solutions which are cost efficient and in a serverless manner so this is the brief agenda and I'll just kind of breeze through it so we'll talk about the problems what are the challenges some of the architectural paradigms that we consider and then three out and then talk about some caveats and the final impact that we had through the architecture that we deployed now brief about the company now episodes is a healthcare risk management firm which means that we take data from insurance and the hospital providers in US and make sure that the clients are reimbursed properly by the US government which means that we have to analyze the entire medical discharge summaries and claim records and understand what are the diseases that have been identified and whether or not they are reimbursable or not so the context is that episodes started focusing on ML and nlp more specifically because they were pricing pressure that is starting off in the market everyone wants automation and everyone wants the cheapest price possible and that is where we come in so we want to build a scalable information extraction engine using nlp and machine learning now there are multiple challenges now the first challenge is of course that you have to have a scalable platform atop which all these data science products will lie but the biggest challenge than that is since we are dealing with patient care data and especially we fall under the US healthcare market we have to maintain HIPAA compliance which means that data has to be secured both at rest and in transit as well and that makes it even more challenging because you can't use all the services you can't use all the car providers and if you were to go use bare metals you have to ensure that regular compliance and audit checks are done on a regular basis but this is the first primary challenge the second challenge that we have to be cost efficient now we are paid per chart we are not paid per disease which means that at the end of the day I need to make sure that my company is profitable and this is something that I keep on saying again and again to many of the data scientists and my peers out there is that the job of a good data scientist at the end of the day is just to ensure either of these two outcomes can either make your company money or you can save your company's money these are the only two outcomes and it is your job as a data scientist to make your company look good and for us in that essence is that the KPI for us what is the cost per chart that will come out after the NLP has been done and that has to be kept low as much as possible and that is where the entire philosophy that we have built up on our entire architecture comes in now something like the first three pieces that scalability, fault tolerance cost effectiveness and lean architecture is something that anyone who is building any kind of architecture will focus on so let's not kind of focus much about that but we are also talking about immutable configurations but we are also talking about self-healing what happens if a region goes down what happens if a availability zone goes down what the process is happening so you have to make sure that the entire process is self-healing in nature and we are not repeating and setting up configurations again and again and that is where tools like Ansible and Docker come in and this is the last pieces I am sure this is slightly a controversial piece that when I started off and when I started building my company and hiring people for that I did not want to hire people who are pure DevOps or pure ML people I wanted to build an MLOps company or a practice where a person can create his own algorithm and deploy it as well so what I wanted to create was an MLOps team so the obvious solution for that is serverless because at the end of the day at least this is my belief that no ops is the best DevOps if I don't have to manage servers on a regular basis apart from some maintenance I think I am doing the best DevOps there are multiple tools out there in the market nowadays which are quite popular Docker, Ansible AWS, Cloud Scripts, CloudWatch and so on which will give you way more control on your entire architecture the way you are trying to build it and the biggest plus point for me was I did not have any servers to manage so anyone who has been working on servers will tell you that the moment a server goes down for even a minute the amount of headache that it entails and that is something that I actively wanted to avoid so what are the implementation points now the first piece was that I had to make sure that my entire configuration is immutable so when I develop algorithms on my system I should be able to test it on a real-time basis so that is where I am using PyCharm as my ID and Docker as my remote interpreter so I am ensuring that my algorithms are tested in real-time and there is no problem when I am deploying it on the servers I am using Lambda, AWS Lambda is a service which is serverless service of AWS and we are using it for mostly some amount of ETL data manipulation and NLP tasks as well we are using Boto3 for generating security credentials on the fly we do not want to save and have those AWS secret keys written in our code so we need to generate security credentials on the fly and we are using Amazon STS for that we are using Ansible this is like doing the 80% of our heavy lifting, triggering all the manual tasks and switching into servers setting them up, pulling data pulling the EBSs, encrypting those disks and so on and of course the final data pull and push to history buckets and the final one is the NLP part so this is the algorithms that are being taken care of so these are some of the libraries that I mentioned some of the libraries were used and now we have moved on to other libraries for example we were using TensorFlow now we are moving to PyTorch so this is a broad stack and the process flow that we have for our entire systems now what is our entire stack now to give you a brief background there are always four important metrics in any data science product accuracy, precision, recall and F scores now depending on your use case some of these metrics may be important and the accuracy is an over height metric I believe but in our use case recall is a very important metric because for us a false negative is losing money I don't identify a disease properly I am losing money hence for me recall is more important so what happens is the moment I have new data coming which is my training data essentially or my new coding push for example I pushed some new over changes to my algorithm so I trigger this entire piece where lambda triggers another server set of servers and my training happens the results are then saved and compared against the prior results and if my recall is better I deploy this new model otherwise I don't I then log my model results and do my necessary logging for my purposes later on so this is the ML deploy so there is not entirely auto ML person but this is automating the ML model deployment so this is the first piece of serverless that we have done and for this most of the pieces that we are using are again lambda, boto3 and Ansible but this is our most important pipeline and this is our entire document processing pipeline so the documents are then uploaded on S3 by client or by our own in house people and we are triggering lambda after that essentially saying lambda hey there was S3 upload one hour back or there were thousand files uploaded one hour back why don't we start processing so I trigger a lambda which will trigger another servers it will then launch into a master worker configuration where the data and tasks are pulled from SQS the reason we have a master worker configuration is that the master and the data processing has to be done in a private subnet and the master needs to and we need to SSH into the worker so master is essentially like a jump box for us and once the data is processed and everything is done we push it back to S3 and then we delete all the servers and there is a secure API that we have created to interact with the results now this is the good parts about it but the tradeoffs and the tradeoff is very simple and this is across any serverless platforms that you will come across now you have Google Cloud Functions which is a competitor for AWS Lambda you have AWS Lambda on the other hand now AWS Lambda gives you a maximum memory allowance of one and a half GBs and the minimum is 128 MBs and pretty sure Google Cloud Functions will have similar ones but atop that AWS Lambda only allows you to launch one lambda for no more than five minutes which means that there will always be this tradeoff between your heavier workloads or lighter workloads now the moment you want to process higher workloads using AWS Lambda it will cost you more money and it's going to be way way more expensive than having a server 24-7 and on the other hand and this is the tradeoff that we have made so we have only ported some of the ETL tasks and I will mention that in the next slides to AWS Lambda now that is costing us low memory we are operating at 120 MB only and it's costing us around 35 dollars for around a million invocations per month so that's the price point however it will run into 3000-4000 dollars if I do the same amount of invocations with more than 500 MB of memory so this is the tradeoff that you have to keep in mind whenever you are building or trying to port or build any serverless solutions per se so this tradeoff is actually very real so what are we using lambdas or any serverless solutions for mostly now we are using it for minor ETL tasks for example we are getting data from neximal format we have to convert it into text formats for ingestion into NLP engine we are using it for assisting into servers which is the master in the public subnet and set up all the servers and run the Ansible scripts there is also a serverless DBA which we call in-house I am not sure if it is the correct technical term so we have our data finally stored in JSON in S3 now the moment my Salesforce application wants to query the results of a process it can query this API what this API will do is launch up a lambda go to that S3 bucket and retrieve the necessary data so that is a serverless DBA API back bar S3 and that is a minor cron jobs for monitoring and logging that we use lambda for so these are the few tasks that will be used by AWS lambda now it may not be for you and this is in pretty absolute terms that I will put forth today is that there are many people who are exploring serverless solutions and then back off because they feel that serverless solutions are pretty expensive now it will be expensive if you want to run a proper machine learning model prediction on lambda which is not going to happen because 1.5 GB for some of the heavier NLP models is not feasible so if you have memory intensive workloads serverless is not for you so you don't want to go serverless if it's a memory intensive workload if you have ultra real time response requirements like for example you want sub 10 millisecond or sub 50 millisecond response time you don't want to go serverless because even though we are getting around 100 to 200 millisecond response even with the API and the launching of the cron jobs that may not be necessarily feasible for the other applications and of course if you have too many library dependencies so we are using built-in functions of python hence we don't have to create a package and then upload it so if you have too many library dependencies where you have to have 5 10 external libraries then you also have to package them in a packaging file and then upload it and then again there is a limit of 250 MB which you can upload so there are many caveats associated with and of course what serverless ensures that you don't have a tighter control on monitoring and so if you have you need a very granular monitoring it's not for you so so mentioning what were the challenges now these were the two broad challenges that we faced during the entire serverless paradigm is what do I do if something fails and how do I monitor each and every step so there are around maybe 30 steps in my Ansible scripts how do I ensure that I'm monitoring each and everything because I need to do that and how to ensure that I'm logging and monitoring my MS scripts so that is what we did so for fault tolerance of self-healing per se we decoupled it using Amazon SQS you can choose your own poison in terms of messaging queues and you can decouple these tasks so what we are doing is the moment a task is launched we don't delete it from the queue unless and until it is pushed back to S3 or to our database so that way we are ensuring that the self-healing properties are ensured and after that we are also launching lambdas to re-initiate those failed tasks and run through the architecture again for monitoring and alerts what we are ensuring that we are using CloudWatch and we are using custom logging from our Ansible scripts and our Python scripts to push back metrics to S3 buckets and CloudTrails while this is not entirely foolproof we are surely being able to launch and analyze our entire logs and that's the way we are currently doing it and as I mentioned in my last slide if you are looking at doing very granular monitoring of a system you are better of porting only some of the minor functions to servers, serverless solutions but the major parts you are better of not doing it so this is the impact that we have had when we are launching the servers in a serverless fashion so to process around a million charts a month and that's the volume rate that I have showcased here which is roughly around 500 GB to a terabyte of data per month it would have cost me around 20 to 25 thousand dollars per month to process that much data and send it back to our databases but with the serverless solution what we are able to do is make it roughly around a tenth and the entire cost of this entire architecture is around 3000 dollars if we go to on demand instances and around 500 dollars if we go to spot instances the reason why there is a range is because you won't always get a spot instance the demand may be high but you may not be allowed your spot instance so you fall back on the on demand and that translates to roughly 20 paisa INR per chart and that chart length can be anywhere from 30 to 50 pages so this is the cost efficiency that we are able to bring in by focusing on having a pure serverless architecture not running servers 24-7 ensuring that the most important tasks are ported and the best part of it is that we have ensured that the entire architecture is a secure environment so the data is secure at both at rest and it transfers and that helps because AWS has many of the HIPAA compliant services but that was a major win for us so this kind of wraps up what I had to share in terms of what work we have done at Episodes you can surely reach me out on this coordinates and let me know if you have any more questions for me that's it Questions? I see one there very basic one I'm sorry but is your architecture entirely serverless or you are actually using Ansible to deploy stuff on servers? Okay so serverless is what triggers our entire architecture now since it's a pure ML kind of background and product that we are building we can't have deployed entire solution on serverless solutions now the way I look at it is serverless is I don't have to manage servers that's my definition of serverless now if I'm able to launch my architecture on on-demand basis and not run servers 24-7 I think I'll save lots of headaches and monies for my company and we are using Lambda to launch those servers as well as some of the ETL tasks and the crown jobs that would typically run on our server Got it thanks and you also made a pretty strong comment about Lambda's and the serverless and monitoring so I think it's pretty straightforward to have everything on CloudWatch and have alerts on them so what is the gap? Okay so when you're doing it on AWS Lambda you would have CloudWatch logs which are generated however not all the metrics are generated so there are the free metrics that CloudWatch gives you there are the custom CloudWatch metrics that you have to generate on your own and then you have to make sure that what is it that you want to monitor in a serverless solution do I want to monitor my memory requirements? Of course not I've capped it but I would want to monitor the time that it takes to go from one function to another and see where my optimization opportunity is like and some of the requirements like where is my task failing and it actually happened couple of months back where our task was failing at one specific ansible task and we are not able to figure it out that's where granular logging is very important and that's where serverless gap will be lying Question I mentioned about HIPAA compliance so I wanted to know if you're using lambdas are are they HIPAA compliant? No they don't follow under the HIPAA BA agreements but we are not processing any data or we are not storing any data on Lambda. Atop that we are just even encrypting the file names while you are moving it from S3 bucket to another because sometimes PHI data, LK data can also be written in the file names. So Lambda is where we are not doing any data transcriptions itself. So the ETL task that you are looking at just to do the conversions from one place to another to ensure that we are not being non-HIPAA compliant we are also doing it on private subnets. Okay, got it. Thank you. So it's running on our own subnets. I'm sure we have many more questions for Manas but we are running out of time. Kindly take the questions offline Thank you. Thanks Manas for your presentation. Is that okay? Do you want... Hello. We have Krishna Priya S from Matt Street then talking about plumbing data pipelines. Also within the next five minutes the OTR session about Spark use cases and challenges in production would start in room one for people who want to attend this OTR session. A very good afternoon to all of you present here. I'm Krishna Priya from Matt Street then and the topic of the talk is plumbing data science pipelines. So data science artificial intelligence, machine learning, these are terms and jargons we hear quite often. Why is building an application which falls into either of these categories a challenge? So the real challenge is the data itself because data has to be aggregated and made compatible for every phase throughout the application. And every phase has to be chained together and the application has to be real time and it has to be scalable. So let me explain this with a couple of examples. Airbnb which is a popular alternative to rentals and accommodations worldwide is using something called air flow management system. So they had a couple of issues that they had to resolve. Their tasks were mission critical and they had this scheduling sequencing problem and their process was evolving. So they needed a scalable, robust architecture that would resolve all these issues and that's what air flow did for them. The next example is something we've all heard for providing recommendations to the user. It might appear really simple but when I say recommendations we give recommendations based on the user's browsing history. So the history is not something that the user browsed yesterday. The history is also real time. When I say real time history it means what the user is browsed pertaining to a session which is near real time. So we provide recommendations based on that. So how do you build a pipeline for that application? And the next one is a popular example. Every one of us who's building an application will have to take care of logging because only if you log you will get to know how your application has performed, what has gone right, what has gone wrong, how has the performance been over a period of time. So for this talk I'm going to take up a small use case of logging to explain how we resolve the problems of logging by building a pipeline end to end. Mostly the do's and don'ts of plumbing the pipeline. Okay, the plumbing story. So you can split it up into three major parts. The first part is the preparation phase. As I said you need to prepare the data. First, figure out what is the problem you're going to solve, ask questions, collect and organize the data and in the next phase you will be applying this and writing algorithms, applying models on the data, processing the data and you will come up with an analysis. This analysis is in the final phase applied and you provide recommend you send reports or do some visualization. So the tech stack for the day is going to be Celery, RabbitMQ, Readers and ElkStack. So why Celery? Celery is a robust, scalable and it helps you build a real-time application end to end. RabbitMQ is a queuing platform and it is also a broker. So it's not just another queuing platform. What does RabbitMQ do is it also takes care of the queuing. So for example, the subscriber has not acknowledged back for a particular message. What RabbitMQ can do is it can send the message back again and then find out what has gone wrong. So it literally manages the queuing. So that's why it's a broker. And ElkStack. ElkStack is a combination of aggregate, analyze and visualize. You have LogStash which helps you to parse the logs and you have ElasticSearch where you actually analyze, put the data and Kibana helps you in visualizing what you put. So this is a brief of the use case I'm going to explain. So you have the logs. They are passed through the RabbitMQ. The RabbitMQ is going to send it to the Celery workers and finally to the ElasticSearch. So this is a little bit in detail, the ETL workflow. So we have the CloudFront and the CloudFront S3 where you have the LogStore and from there you send it through an SQS and you pull the messages from the SQS and RabbitMQ is the broker that manages to send it to the Celery workers. Why is Reader coming there? I'll explain that to you in a bit. And then finally they are pushed to ElasticSearch. The Kinesis and Redshift here are backups in case the push to ElasticSearch does not work. We are going to be explaining that here. We will stop with ElasticSearch. So I split the use case into three simple parts. The first part is going to be the polling to SQS. The second part is going to be processing the logs which the Celery workers will be doing and the third part is going to be the push to ElasticSearch. So a little bit more about Celery. So what you can do in Celery is you can handle both compute optimized and memory optimized tasks and you can assign workers to them and what you can actually do is you can completely use all the cores. You can maximize the CPU utilization to even 100% but keep the memory really, really low. And it is heavily parallelized. Of course it's asynchronous. Okay, so let's get to the actual methods. The poll and the push. So the polling is actually going to poll from the SQS queue. So there is something called Max Retrise that I have mentioned here. Similarly in the push there's something called RateLimit I've mentioned here. So I will explain that to you with use cases. So Max Retrise in case on the polling side the connection to the SQS is lost or there is some failure that has happened on the SQS side and you have to handle the failure. What does Max Retrise can do is it can back off for a particular amount of time and retry again. So based on the time taken for back off and how many times it retries you will get time enough to fix the failure and before the last retry that is fixed the application is going to go on. So the next is the RateLimit. So how do you find the RateLimit? RabbitMQ has a beautiful UI that you can enable on the command prompt enable RabbitMQ management and you can see how many how many tasks per second, how many messages per second the RabbitMQ is actually doing. So in RabbitMQ because we have two methods the polling and the pushing. So we have two queues the poll queue and the L queue. The RateLimit problem in this use case might usually occur on the Elk side. For example you are polling a lot of messages but on the elastic search based on the cluster size it's not able to push it at the rate of the polling. You need to rate limit it so that your messages are not lost. So that's where the RateLimit comes into place. So this is more of a producer-consumer problem you are producing at a very high rate but the consumer is not able to consume the messages at that rate. So what you do is find out how the RabbitMQ is performing as of now it is performing at 73 per second. So when I talk about salary workers you imagine there are 10 workers and every worker is doing 73 per second which means almost 700 tasks per second are there and they are pushing to elastic search. If elastic search is not able to handle that you rate limit it. So what I have done is I have rate limited it to 20 which means if I have 10 workers so you sum it up you will only have 200 tasks per second. May be elastic search will be able to handle this is probably a conservative way but it will take some time for you to figure out the balance between the producer and consumer and come up with the proper rate limit. So we saw Reed is coming there in the ETL workflow right? So Reed is an in-memory DB where you can store some amount of information in the memory. So I'll explain a use case for this in the SQS for example the messages on the SQS are becoming zero there are no messages so you don't want the polling to happen for an empty queue right? So instead what you can do is you can save the attributes of the queue in a session on Reedus and at the same time if the messages on the SQS queue are really high. So what you can do is when you regularly get the attributes of the queue and save it on the Reedus session these workers can go and get the session information and based on that what you can do is you will have a runner that is going to manage the entire thing so the runner can be dynamically throttle so if there are no messages on the queue then you can put a lot of sleep time on the runner and then try polling after some time there are a lot of messages on the SQS queue then increase the sleep time so that you throttle the runner based on that so that's where Reedus comes into play but how do you refresh the Reedus? So that's where you have something called the salary beat worker what the beat worker does is in regular intervals that you mentioned it in the program it will go and refresh the sessions on the Reedus so this is where you will mention how many times you want the beat worker to do this refresh so you're talking about workers right and the producer consume a problem so once you've figured out how many poll workers and how many elk workers can this application handle what you should do is when you want to increase the number of workers you'll do it in blocks because there is a balance between the producer consumer so you have one block of the poll worker and elk worker handling it so when you're adding more workers you will add another block of workers and not randomly add elk workers and poll workers so that your application always is in a balance and salary has a really beautiful way of letting you know that a new worker has actually joined the party so you can actually tell the workers and find out how it has been performing so there is something called autoscale so you can actually set a worker to autoscale if you think that a particular task it has the ability to you need a sudden throttling of the task you need the workers to scale up during peak times so that's where this autoscale comes into place but we saw that there is something called rate limit so when you're autoscaling you have to keep in mind that the maximum autoscale number that you've mentioned the rate limit and the number of concurrent workers everything should be a balance so that on the consumer side it still does not have a problem and you don't lose any data HETCHTOP is your best buddy so after doing all this what you're supposed to do is go to HETCHTOP and see how much of CPU you've utilized and how much of memory you've utilized so an ideal situation is where the CPU can be even 100% we've used only 70 to 80% of the CPU and the memory is also really low but in case you're using the CPU completely but your memory is really really high then there is something that you have to change on the code so how do you do the memory profiling before I get into that let me extend a little bit on the Elkstack so Elkstack as we saw is a combination of Elasticsearch LogStash and Kibana so what LogStash does is it helps you with the parsing so you can actually mention a pattern by which it can parse and then push it to Elasticsearch we're not using LogStash here for the talk it's out of the scope of the talk but it's fairly simple to set up a LogStash and then after you push it to Elasticsearch you visualize it on Kibana so on a Kibana dashboard on Kibana you don't have to really know how Elasticsearch query is done if you know how you want to filter your data how you want to visualize your data you can just apply the filters right away what do you want to see on the X-axis whether you want to see the count how do you want to aggregate you have a fair idea about it you can just set a range with which you want to see the data and then visualize the data automatically in case you're familiar with Elasticsearch query you can do that too so there is a tab called DevTools here where you can actually execute the query and get results right away so we're talking about memory profiling right so there are a couple of things you have to look at when you're doing memory profiling first of all you have to clean up the code when I say clean up the code you have to look for cyclic references if you have on your code and you have lists versus generators and if you have objects that are really large occupying and accumulating a lot of memory take a look at all of that and we were having readest sessions right so what the salary workers will do is they will maintain the state of the sessions in memory so you'll have to take a look at that and monitor that too and connection objects we have two connection objects in this use case we have an SQS connection object and we have an Elasticsearch connection object it's best if you keep the connection object outside the salary task because every time you create a task you create a connection object it's going to sit on the memory instead you can keep these connection objects inside the SQS connection pools which will help you in profiling the memory so a summary of all this is how we built a real-time scalable streaming solution it helped us in handling customers in real-time high demand and the search history as I said search history pertaining to a session which was also near real-time and we were able to do in-memory scaling lower latency everything so to put it all in a summary now you can do some pointers of what we learned perfect plumbing can have find out which part of your program is computer optimized which is memory optimized and assign separate workers for the computer and memory optimized tasks and then find out the producer consumer balance and then see and keep monitoring on htap to see if your memory is low and your computation how it is happening finally the rabbit mq it is best if you keep the rabbit mq on a separate instance and not in the same instance where you are actually doing the salary tasks so that's about it any questions questions I see one there hi thanks for the talk yeah so my question was actually kind of I was just curious why there are two queuing services in the rabbit mq kind of serving I don't know similar purposes could you possibly have the workers directly polling sqs and then push it to elastic search or maybe push the logs directly in the rabbit mq so I'm just wondering why you had two sort of queues okay so that was the similar use case that we had the messages were usually coming from the sqs for us so we didn't really play around with the sqs part so the use case started only from the sqs for us how do we get the messages from sqs and then push it to elastic search and visualize it so we didn't really resolve the we didn't think of resolving the sqs part of it a log stash is blocking it does not have an asynchronous call so you needed something like salary to make it asynchronous next question yeah so my question is you are using rabbit mq and elk stack right so did you consider using anything else like in place of rabbit mq try something like Kafka and then spark instead of elk yeah so this is this is probably a medium scale application and rabbit mq what it does is it helps you do the real time as and when the message is published you get it back on this side if Kafka is handling it probably on a much higher scale and setting it up and everything is going to going to require a lot more of work so for an application as simple as that this this tech stack help us do it much easy questions I guess that's it thank you thank you Krishan Priya reminder again for all of you to fill feedback forms they are placed on your seat they are different from the ones distributed in the morning the morning feedback forms for the talks from morning to afternoon until lunch and these are for the talks starting from lunch until the end of the event so please go ahead do give us your feedback this helps us make the conferences better the otr session on spark use cases and challenges in production is about to start in room 0 1 it starts at at 2 30 coming up next is Vimal Sharma from Hortonworks who will be talking about governance using Apache Atlas why and how good afternoon everyone I am Vimal Sharma I am a software engineer at Hortonworks so I will be talking today about Apache Atlas so a quick show of hands how many of you have heard about Apache Hadoop or at least used it so majority and how many of you have at least heard about Apache Atlas okay that's good to know so this is probably a good audience for me to at least introduce Apache Atlas and see how it works so with respect to Apache Atlas I am a member of project management committee and committed to the project these are some of the details about the project the development for Apache Atlas project started in late 2014 and it was incubated to Apache in May 2015 developers from organizations like IBM Hortonworks, Aetna, Merck and Target are involved in the evolution of the project there have been three releases in past one year 0.7 release which was in July 2016 0.7.1 which was in Jan this year 0.8 release which was in March this year and Atlas graduated to a top level Apache project very recently last month so a high level overview of the project so Apache Atlas is basically the governance and metadata management framework which was initially built for Hadoop framework it was generic and flexible enough so that any arbitrary component can be modeled and its metadata can be captured so we can have two types of metadata the first kind of metadata is that of the data asset for instance a high table or an edge based column family the second kind of metadata is that of the processes occurring within that component or across those components so if there is a CTA query in Hadoo that can be captured in Atlas plus if there is a yarn job which picks up some data from edge based does some processing and then dumps it into Kafka these kind of events can also be captured apart from the metadata capturing and visualization Atlas also provides the facility to classify the metadata entities using tags which can be used in tandem with Ranger to enforce security policies so Atlas has built-in support for many popular Hadoop components like Hive, Storm and Scoop and its architecture is very flexible so that any component can be modeled and captured so let's look into the use cases for the governance problem the first use case will let be of the extract transform load pipelines so if I am the owner of a current ETL pipeline so how do I download any upstream failure so let's say my source dataset is dependent on an ETL pipeline and which fails so there is some data quality issue or data itself is not there so how do I debug this kind of issues plus if there is any downstream ETL pipeline downstream dataset which derives from my computer dataset if there is a failure in my pipeline how do I alert the downstream owners so that they can take proper measures to correct the data quality issues so it seems a visual lineage and a record of such kind of dependencies would be very helpful and Atlas does exactly that second use case is that of the redundant processing so if I am the developer and I want to compute some information based on the source dataset so can I avoid the redundant processing so if I have some mechanism to know whether that information is already available in one of the datasets can I avoid the redundant processing so we will see that Atlas we can exploit the Atlas classification feature to know whether the information which we need is already there third use case is that of the compliance and security from a business point of view so if we want to restrict access to sensitive information to some set of users how do I enforce these policies so the standard solution would be to apply a ranger policies on these datasets but that would be an overkill I mean if there are datasets across components which have sensitive information like a high table can have sensitive information and HDFS file can have sensitive information so how do I make sure that there is a single policy which we can apply and the same kind of rules will apply across components fourth use case is that of the cluster admin very often cluster admins have this use case of cleaning up their cluster from dormant and unused datasets so it seems if there is some mechanism to come up with a relevance score for a dataset so low relevance datasets can be archived or deleted from the dataset all together so using Atlas ETL using Atlas lineage diagrams and classification features cluster admins can come up with some relevance score based on which they can take decision to archive or keep the dataset this is the Atlas architecture at the core of Atlas is its type system which we will look into more detail in further slides apart from the type system there is the ingestion and export engine which is responsible for ingesting the metadata events and entities as well as the export of Atlas models the metadata as a graph and it uses the titan library to do this the metadata is actually stored in hbase and there is a solar based indexing engine which is used to improve the search capability over the metadata as I mentioned earlier Atlas has built in support for many components like high scoop and storm and metadata events in these components are communicated to Atlas via Kafka queue and apart from this Atlas publishes the tag addition and deletion events to an entity to another Kafka queue which Apache ranger is a subscriber and it can use the tag addition and deletion event to enforce the policies which are defined on that particular tag so there is a high level REST API which can be leveraged to do all these expressions like ingestion, export as well as registration of type system and search cross component lineage is one of the central features of Apache Atlas so lineage is a visual diagram between the dependencies among the datasets so if we have created a table external table with some hdfs part the lineage will look like the one shown in the diagram on Atlas UI components like Hive or hdfs can have their own metadata store and the metadata store logs can be used to investigate what events happen within that component but where Atlas comes into the picture is the events which are cross component so let's say there is one spark job which picks up some data from hdfs does some processing and then dumps it into a Kafka topic so in this case neither the spark logs nor the hdfs logs would be helpful in connecting this event so Atlas can be leveraged to define the model for several components and capture all these events which are occurring across components so ranger is a listener on the Kafka topic to which Atlas publishes the tag edition and deletion events and if we have defined any security policies on top of those tags those will start applying on that particular data asset so we have attribute based policies rather than asset based policies we don't need to define policies individually for individual data asset we can define policies based on the tag and whenever we attach the tag to the data asset in Atlas UI the corresponding policies will start getting applied so type system is central to Atlas so it's a skeleton of the metadata which we want to store in Atlas type system is analogous to object oriented programming in the sense that in OOP we have the notion of class it can have attributes and super types so similarly a type can have attributes it will have a unique name and a set of super types so an instance of a class is termed as object in OOP and instance of a type in Atlas is called entity in Atlas terminology so attributes can have several properties like it can be mandatory or optional it can be unique which will uniquely identify it across the Atlas repository it can be composite meaning which the lifetime of the attribute is controlled by the parent entity an example would be the high columns so high columns don't have any identity outside of the high table in which they reside so if we delete the high table the corresponding columns should be deleted as well reverse reference is another property which is mainly used for a back pointer reference to the enclosing entity so our example of high table applies here as well so columns individually will have a back reference to the enclosing high table so Atlas has a bunch of base types that are predefined for you and those are bootstabbed whenever Atlas server is restarted we will go through some of them so reference table is the one which has a mandatory attribute named qualified dem which is a unique attribute which identifies this metadata entity uniquely across the Atlas store asset is the one which is used to identify entities which have some notion of ownership so it will have a mandatory attribute name and optional attributes owner and description dataset derives from reference table and asset and it is the one which is responsible for the entities which are actually stored in Atlas so a high table for instance will be an instance of dataset process is another one which also inherits from reference table and asset and it has optional attributes input and outputs so process is the one which is responsible for tracking the lineage in Atlas store and when we navigate to the Atlas UI we will see all the set of inputs which are used and the output dataset which are derived so we will be modeling the Spark data frame type in our demo so a little bit of introduction about Spark so in Spark there is this concept of RDD which are the basic unit of execution in Spark framework data frames are a special type of RDD which are some notion of relation among its data so may be adjacent in which there is a structure to the data or a high table the information is related in some sense so let's try to model the Spark data frame type so as outlined in the slide above Spark data frame will inherit from will be inheriting from dataset type and it will have additional attributes source will be a mandatory attribute which will indicate from which source the data is derived destination and columns will be the optional set of attributes data frame column is another type which will also inherit from dataset and it will have mandatory attributes type which will indicate the kind of data which is there whether it's a string type or an integer type and data frame which is the reference to the parent data frame and an optional command so as I mentioned earlier Atlas stores the metadata entities as a graph so this is a snapshot of the type which we have just defined and the entities so vertex one indicates the Spark data frame type vertex three indicates the data frame entity and it will have adjust to its type as well as the columns which it contains so four and five four and five are the column vertices and they will have their attribute values and they will have adjust to the column data frame type so let's consider this use case in which there is a salary disbursement Spark data frame Spark process so payroll details of an employee of all the employees are in one HDFS path which will contain personal details like monthly salary bank account number name and more details and there is another HDFS path which contains the variable components like bonus and stock so Spark process will pick up data from these HDFS paths do its processing and then compute the monthly salary for all the employees and then finally dump it into a Kafka topic from where the actual disbursement will be done so as we can see not visible right so here I have declared the Spark data frame type is it visible okay so let's assume that I have put the model in place for Spark data frame type and data frame column and then register those entities using this piece of code and link them together using this piece of code so now we will go to the Atlas UI search for this process which you registered and we can see that this lineage is created these are the source HDFS paths and this is the process which picks up data from HDFS path and this is the Spark data frame and then this is another process which puts the data into Kafka topic so we can go down to inspect other attributes like columns and solidified name and source as well and this is the tags tab which shows the list of the tags which are attached to this particular entity and along with this there is the audit which shows the operations which you are doing on that entity so we can go ahead and attach any tag to this particular entity so let's say I am defining I added this particular tag to this entity tag expires on and I specified the value to first of September so this information will be related to Apache ranger and if there is a policy of the kind in Apache ranger that all the expired data sets should not be allowed access they should not be exposed to the user and admin so this will start applying on this particular entity as well then there is the tags tab on which we can see all the information categorized by the kind of tags which are attached to that particular entity so we can see that the expires on tag has these particular entities and we can navigate to PII personally identifiable information and we can see the list of entities which are attached to that particular tag so there is search capability in Atlas UI so this is the basic search which is based on the solar based index so if we type in any keyword all the entities which have that keyword in any of the attributes will be written in the list of results along with this there is advanced search in which we can specify more advanced predicates we can search by the value of the attributes which are there in that particular entity okay so the roadmap for Apache Atlas so as I mentioned there is inbuilt support for components like Hive, Storm and Scope the inbuilt support for other popular components like Spark, HBase and NiFi is in the roadmap along with this there is the column level lineage so as of now for Hive there is table level lineage which is available but going forward we will be adding the support for column level lineage as well so let's say there is a query of this kind create table destination select some operations on the columns and from the source table so value in X is derived from the source columns A and B so if we navigate to the entity page for column X we will get to see the lineage diagram where in A and B the source tables will be on the left hand side along with the operation in the middle and then we derive the value for X apart from this the import and export of the metadata is in the roadmap so all the metadata which is there in Atlas repository if it can be exported in a well known format so that it can be consumed by third party metadata tools if there is some data in other third party tools can it be ingested into Atlas in a seamless manner apart from this there is work in open discovery framework which will provide the capability for data scientists to go to an entity page and compute basic metrics on the data which is there in the entity so let's say there is a Hive table I want to know the data quality score for that table or various basic statistical metrics like mean, median things like this to determine the data quality score so these are the details of the project project website dev and user mailing list along with the list of JIRAs with the open issues so Atlas has been getting significant traction recently mostly because of the critical area of the enterprise governance and various enterprises have been actively using Apache Atlas to govern their enterprise clusters apart from this Atlas has a very rich code base in Java and Scala so I would like to take this opportunity to invite potential developers to check out the project page to check out the code and see if the list of open items interest you and I would we would be happy to welcome you on board and start taking up issues and contribute to the project I would close the presentation here and open for any questions from the audience So is column level lineage supporter only for Hive in the road map or supporter for other hooks as well? Sorry I didn't go to the question Column level lineage you are talking about right? Yeah it's only for Hive in road map it's only for Hive it's only for others then we have to extend it or like we can I mean it's so flexible that you can easily extend the model for any other component and start publishing data into Atlas I mean for Spark as well you can put the model in place and then once the model is there you can register the entities using the REST API which is there in Atlas So on one hand people who have administrative access for the cluster like Hadoop cluster we use Apache Ambari let's say right installing a component or installing a component and other things Why was this developed as a separate component and not within Apache Ambari as a data governance kind of thing? Can you repeat the question? See we have Apache Ambari for administrative tasks like installing HBase or adding a new region server or adding a new data node in the Hadoop cluster right? So the people who are going to have access to Apache Ambari the admins of Hadoop they are the ones who are also going to apply data governance policies Structurally fundamentally Atlas can be different from Apache Ambari but from a user point of view would it not have made sense had they been like a same project done within the same project? I mean Atlas is an open source project so you can spin up independently from Ambari and start registering metadata and I mean connect it to ranger to impose policies on top of all the data which is there. It doesn't depend on Ambari in any sense No it's not about dependence I'm asking would it have made sense to develop the features that you have developed within Atlas as part of Ambari itself? I don't see I mean so Ambari is meant for Hadoop components right? And Atlas was started as an independent effort I mean basically from an industry point of view to govern the metadata but it's different it's extensible and it's flexible so that any arbitrary component whether it be or Hadoop component or not can be modeled using Atlas and its metadata can be captured For the demo that you showed the demo that you showed the SPA salary processor right? How can we provide details about the processor like what kind of processing it is actually doing is it we need to add those as the Yeah so we will have to add attributes in the model which we defined so this is a very simple model which I used for the demo so we can add all attributes which we want to capture let's say how much time did it take or what kind of CPU power did it consume so we can capture all those attributes and register it with Atlas Can that be the part of the hooks that are developed or when the spark hook comes will it be part of it? Yeah so that depends on the developer so if you are putting the model into place then you can put in triggers after your spark job so when the job finishes all this metadata will be captured and then it will be reported to Atlas Okay thanks and I have another question Yeah See at present when we manage through Ranger we will allow even the end users to manage their policies this is the same way possible in Atlas they can define the governance and the data Yeah so Atlas integrates with Ranger in the sense that you can define tags in Atlas and the information about tag addition and deletion is communicated to Ranger and on the same tag we can define policies in Ranger like policies or data expiry policies so Atlas closely integrates with Ranger in that sense Atlas is not the policy management framework it is meant to capture metadata and then provide actions on top of the metadata Yeah so in an enterprise cluster admin will have I mean the power to define tags and attach tags to particular data sets the only proper user management any developer can't go and attach tags to a particular data set So if anyone is trying to use this for an enterprise application right what would be the key considerations for his scalability and high availability Atlas is highly available and it's scalable in the sense that it uses HBase as its metadata repository so it's highly scalable you can put in as much data as you want and it's because it's been tested with enterprise Hadoop cluster so we did have some performance optimizations when these were reported but it seems to work well in enterprise Hadoop clusters as well lot of data yeah we will catch up on it Yeah so let's say if I were to develop a hook for elastic search can you hear me Yeah let's say if I were to develop a hook for elastic search using this would I have to fork your codebase and add it or are there just any interfaces that I can implement and that will start working for elastic search how easy or difficult is it? Yeah so you won't have to fiddle around with Atlas codebase you will have to have an understanding of the component which you are trying to model what all attributes you want to model which you want to capture once that understanding is in place you can define the model that will be a JSON and then using the Atlas REST API you can register that model with Atlas and after that you can start capturing the metadata for that component More questions? From Hortonworks there is this schema registry right So is there any Could you please stand by asking the question Is there any intersection in the features between that and this if yes why is that a separate project? Schema registry is actually quite different from Atlas that is mostly for streaming applications wherein we want to define beforehand what kind of messages will be produced and consumed from the topic Atlas is more from a metadata point of view not the actual data what kind of events are happening within that component or across the components in the enterprise cluster Does that answer your question? So if I just have to store the schema of whatever message is coming in I can do that using this one as well right I don't need schema registry No actually schema registry is more of suited to streaming applications as I said you define the schema and register it in a database and then the producers use that to I mean the schema registry engine makes sure that the schemas which are already registered are only produced the data which is registered the schema which is registered the data follows that schema Atlas is I mean more on what is happening across the component or within that component Okay you mean per message in case of schema registry I'm sorry to cut you short but you'll have to take the discussion offline Thank you Thank you very much Up next is OTI which would be conducted in the banquet hall on the topic of securing data stored in the cloud for big data analysis before that let's stretch out a little we have Lochan with us who will be telling us some good moves Woohoo Hello Okay most people seem to be leaving Let's see who stays back Looks like I'm going to be sitting by myself Two people are standing up What's happening with this I can just mostly see white anyway But thanks for your interest thanks for staying back I feel a little more encouraged Yeah Okay so Let's close our eyes again Guys are you in are you out you don't feel like it Okay as closed Recalibrate Focus on the breath