 Yes, so I'll start. I'll just start with a brief introduction. My name is Gaurav Bora and I'm the founder of Jigsaw Academy. Jigsaw Academy is basically an online analytics training institute. So we don't train only of online stuff. It's just an analytics training institute that primarily conducts classes in the online world. So one of the USPs that we claim for ourselves is that we use a lot of business case studies as part of our training. So the idea is to give everyone a flavor of how real business data looks like. And you know how companies are actually using it, what kind of analysis they are doing and how you're solving real business problems using data. So there are many different case studies that we use as part of our course and I thought I'll share one of these case studies with you guys. Just to give you a flavor of how analytics is used across various industries. So the case study, this presentation is basically about risk based pricing for car insurance in India. So before I get into that, just a simple definition of risk based insurance pricing. Are you guys aware of what this is? Any guesses on what this means? No, there are a lot of factors for paying the claims for the person who inserts the money. So based on his profile, you know, a premium is charged. So the people with a higher risk of claiming have a more percentage of the premium. So essentially you're right. The only difference is in the Indian context, the insurance that is given is for the vehicle. It's not for the person. This is different from some of the other countries. For example, in the US, if you're taking insurance, it's for both the car as well as the driver. In India, the insurance is only for the automobile, only for the car. So this presents some challenges and we'll cover these in the last part of the presentation. Because there are a lot of factors which are very natural indicators of the risk profile of the particular policy. But they pertain to the individual. But in India, we don't have data on the individual. So any kind of insurance that is done on the automobile is basically on the vehicle. So the factors that we can take into account are primarily only on the vehicle that is to be insured. So risk-based insurance pricing is primarily pricing, doing a differential pricing for different customers based on the likelihood of their having an accident and filing a claim. So the idea is that not everyone will have the same probability of the same risk of getting into an accident. So how can you analyze historical data to find out if certain groups are more prone to accidents, more prone to filing claims and charge a higher premium to them? So this is something that happened about six or seven years ago. The Indian insurance sector was actually opened up. Before that, legal regulations actually prevented companies from doing any kind of risk-based pricing. So insurance in India was very straightforward. Basically, there's something called IDB which is the insurer's declared value, which is the value for which you will insure your car or your bike or anything. So insurance was basically a fixed percentage of that value. So regardless of where you live, what vehicle you're driving, any of those factors, there was a flat ring that was charged for everyone. About six years ago, six, seven years ago, the insurance sector opened up and companies were allowed to do risk-based pricing. So that's when they actually started thinking about what's the data that we have and how we can leverage it to get some more insights and do a better pricing. So that's basically what this case study is about. And I'm going to keep it fairly simple. This data, even though we don't have hundreds or thousands of fields in the data, it can still get pretty complicated. There are a lot of variables that you can look at. For the purpose of this presentation, I've just focused on two risk determinants, two factors that are the most common or the most predictive of risk. This is basically your location and the model of the car. How did you know that these are the most predictive? Well, I'll show you all the information that we have. And when we started off, of course, we treated each variable with equal importance. We didn't have any self-pious. But based on what we've seen, so basically what we've done is, we've worked on this case study, we've done this. And now what I've taken is a piece of it in order to keep it simple, otherwise it just becomes too technical. So I've taken the two variables that have come out statistically most significant in the majority of the cases. And so this is just a snapshot of just a glimpse of the kind of data that is there. Again, this data has been cleaned up, this has been simplified. But essentially, you have a policy number which is a unique identifier for each policy. And you have information about the vehicle. You'll notice that there is no information about the driver here. That's again because legal regulations prevent the insurance company from collecting any such information or even if they're not collected, using it for any kind of risk-based pricing. So you will only see vehicle attributes here. So you see the manufacturer, the model, the year in which the car was manufactured. The field IDV where you see all these numerical values and basically the value for which the car is insured for. Then you have the city, the state, the region, the amount of claim. So again, normally your policy data and your claim data are two different data sets. What we've done is we've just combined the two. So you can see for some of the policies there are claim values. So these are cars which entered into an accident and there was a claim that was filed and this is the value that the company spent. This is all historical data. You can see the cubic capacity of the engine, whether there was a claim or not, the manufacturer model. And the last column here is the premium. This is the amount that the person is paying on an annual basis. This is the insurance premium. So this is basically the data that we had. Okay. And the typical modeling methodology and sure all of you are. Yes. Can you get the data from the insurance company? Yes, yes. So see the data is from the insurance company that we work with. But obviously for legal reasons we've had to mask a lot of the data and some of the insights that I will show you are not going to be accurate. We've swapped some values because those are things that we can't share. So the idea here is to give you an idea of how the methodology that we built rather than focus on the exact numbers and the statistics that you'll see here. So we followed the standard methodology. There's a problem definition. So there was a business problem that was given to us. How do you convert that into an analytical problem? This is obviously the first step for any kind of analytics project. You need to figure out what is it that you're modeling for, what is it that you're trying to predict. Also in my experience one of the trickiest parts of the whole process. This is where the projects either become really successful or they die off. Then you start with the data exploration part piece. Then you prepare the data. Once you're familiar with the data, you prepare the data for the modeling, any kind of modeling approach that you're going to take. You build the actual model, validate the results of the model, implement the model, track the results, and of course keep refreshing the model. So this is your standard modeling methodology. What I'm going to do is focus on the first four steps of this process and primarily on the first three steps. How did we define the problem? How did we convert the business problem into an analytical problem? Data exploration, data preparation, and different kinds of modeling approaches that can be taken. So let's start with the first step of the process. Again, usually for most kind of analytics problems, this is the most interesting step. So problem definition. This is basically the problem that was given to us by the business. Your premium. Premium is what the insurance company is charging for insuring the car. So this is what you're paying to the insurance company. It's X percentage of IDB. So premium is a fixed percentage of the value that you've declared for your car. This X is the rate that is assigned by the company. Essentially what we have to do is determine the optimal level of X. So till now the company has been using a flat rate X that they've been charging for all models or locations and everything. Now what they're trying to do is come up with different values of X for different models based on the risk profile. So this is essentially what the business told us. Now there are a couple of questions here. What is the revenue of the insurance company? So if I were to ask you, what is the revenue stream for the insurance company? What is it that they are making money from? Insurance policy? From the premiums, exactly. So the revenue is, if we just simplify things for the sake of understanding. So the revenue for the insurance company is basically the premiums that are coming. And what is the cost? Cost is the claims. There is of course an administrative cost, marketing cost and all of that. For the purpose of this analysis, we don't need to get into that. We have a very simplified situation where the premium is the revenue and the claim is the cost. So essentially the profit is the sum of all premiums that the company is receiving minus the claims that they are paying out. And if we just manipulate this equation. So the premium essentially comes out to be the claim that is paid out plus the profit that you want to keep. So essentially what is it that the company is trying to do? They are trying to predict. So in terms of the analytical approach, what we are doing is we are predicting the claim as a percentage of IDB. So what percentage, so if a car is insured for 100, what percentage of that is the expected claim? Is what you expect to pay out? And once you define claim as a wide percentage of IDB, then there is a delta that we can add to it that's a business decision on what amount of profit they want to keep and we can decide on next. So given the problem that the business had, this is how we just converted it into an analytical problem. So now the goal of our exercise is to predict claim as a percentage of IDB. So just some high level data exploration, any new kind of data, the first step that you will do is summarize the data in order to understand it. So this is the table where basically we had 41,096 policies. This is the data that we have. Of these policies, 7,702 had claims and actual claims. And the total premium received by the company on these policies is 435 million rupees, 43.5 crores. The premium as a percentage of IDB at the macro level, aggregate level is about 2.78%. So this is what the company is receiving. Total claims paid out 375 million claim as a percentage of IDB 2.4%. So essentially this is what the company is doing. They are charging 2.78% of the IDB as premium and they are paying out 2.4% as the payout. So this just gives us a sense of what the overall data looks like. This is very important because you need to have a very good sense of what the data looks like in order to find out what is unusual, what is different from the rest. So again, this is a simplified version of the data. We have 12 models and we have 16 cities. So these are the 12 car models that are there across these 41,000 policies and these are the 16 cities that are there. So we have done a little more of exploration. We are trying to figure out what is the frequency of each of these models. So Maruti Wagonar is 18% of all the policies most common. Subishi Seedia is the lowest one. We do the same with the cities just to get a sense of where the data is coming from, what values it has. The next step that we do is basically we rank all the models on the basis of in terms of the claim as a percentage of IDB. So in the data we have the claim, we have the IDB. What we have done is we have just aggregated that and for each model, what is the average claim as a percentage of IDB. We can see some very interesting claims here, very apparent. Hyundai Sandpro is the most profitable model for the company as an overall, you might not know, macro level. 1.74% is what they are paying, they are charging. We saw they are charging 2.84% on an average. So just looking at this table, you can immediately see that these models at the top are highly profitable for the company. The models at the bottom are not profitable. The bottom three in fact are loss making for the company. We have also classified them into low, medium, high. We have done the same with the cities. Again, you can see some trends. There are some cities which are highly profitable. There are some cities which are actually loss making. If you just look at the names, you can see that some of the smaller cities are very profitable. The larger cities tend to be unprofitable. Intuitively makes sense because the larger the city, the more the traffic, the more likelihood you have of getting into an accident. So immediately you can see if you look at the previous table and if you look at this, the company should probably charge a higher price for SUVs. You can see Tata Safari and Scorpio there right at the bottom. So SUVs tend to be more risky than the rest of the population. Similarly, bigger cities tend to be more accident prone than the rest of the population. What we have done is we have taken a very simplistic approach here. We have clubbed all the models into high, medium and low. We have clubbed all the cities into high, medium and low and we have just taken up 3x3 matrix. High, high, high, medium, high, low, medium, those 9 combinations. You can clearly see the difference here in terms of what the company is paying out as a percentage of IDV. So for the company now it makes sense to charge a higher rate for any Tata Indica or Honda Accord or Tata Safari or Scorpio that comes from these cities. So they can have a differential pricing based on the model that you have and the city that you are coming from. And you can have different prices. So if there is a Ford Figo that comes from Jalandhar, you can charge a much lower rate because you know that the likelihood of taking out a claim is that much lower. So this is the first step that we have done. Now there are obvious problems with this approach. Anyone would like to point out the problems with this approach? People might register somewhere and they might try with differences. Okay, yeah. Just this approach as a model. Right. No, I assume, see when we did the actual model the age of the vehicle came out as a significant factor that was put in. I have not taken that because I didn't want to show a 3x3 matrix. It's just easier to show it with two variables. Assuming that this is all the information, there's a very obvious problem with this model that you can see here. It's an aggregated model. Okay, so Gohati and Gurgaon are cities that are high-risk cities. And Tata, Indica and Ford are models that are high-risk models. But that doesn't mean that a combination of Gohati and Tata, Indica is necessarily a high-risk model. Because we aggregated it here. What we could have done is taken each of these cities individually, each of these models individually and done a 16 by 12, 192 cell grid and then picked out each value independently. So we could have done that because we have two variables and we've got only 16 and 12 values in each of these variables. If you have 20 different variables and some of them are continuous, some of them are discrete, you really can't use this cross-tabulation approach. So this is essentially a BI approach where you're just slicing and dicing and this is the frequency in each of those cells. This is a slight improvement on the safe approach where you're using an approach called decision trees. So this approach is essentially a cross-tab approach but what it does is it automates the way in which you will select your splits. So if you have 10 different variables and you have 20 different values, which variable to choose and what value to split at to show these kind of splits, the decision tree algorithm does it in an automated way. So here what they've done is for the same data, instead of 9 cells, we've now ended up with 8 different cells. 1, 2, 3, 4, 5, 6, 7 and 8. These end nodes of the tree are basically your population. Your entire population is divided into these 8 rows. And you can see that the highest rate here is 3.65%. When we did the simple cross-tab approach, the highest rate that we had was about 3.5%. 3.2%, right? I'll just take two minutes. So what you can also see is, if you see here, Bohati, Gurgaon, Bangalore, Delhi and Chennai are the cities that have been classified as high-risk cities. When we move to the tree approach, Chennai has moved from that bucket. It has come down into this bucket. So there have been slight adjustments but overall with this approach, you can see that the differentiation from 3.65 to 1.4 as opposed to 3.2 to 1.4 that you saw from the tree. This approach is a lot better using this approach. So this is essentially doing things with a little more rigor, taking it one step further. You built a decision tree to do the whole thing. Now you could use regression analysis for the same. You could use complex neural networks. You could use a whole lot of different approaches with this kind of modeling. Essentially the goal here was, the modeling approach was really dependent on the way the client was going to implement this. So if we built a regression model or if we built a neural network model, we essentially need a computer. We need some kind of a scoring engine to come up with any kind of decision. But in the insurance industry, if you guys have seen, typically what happens is the insurance guy calls you. The sales rep will call you. He'll come to your house. He'll get your details. He'll give you a quote and the insurance policy is priced that way. So what the insurance company wanted to do was come up with a very simple structure which is basically a sheet of paper that they can hand over to their associates and say do the pricing based on this. So you can't build a regression model and give the results on a piece of paper. You can't build a neural network model and give the results on a paper. So essentially this decision tree approach comes in useful when you want to do any kind of modeling where the whole approach has to be very transparent and where the whole modeling methodology has to be very clear and apparent to everyone. The results have to be easily implementable. So that essentially was what we did here. And just a final point, the first approach was a cross-tabulation approach. The second approach is a decision tree approach. So I've seen those paper-based charts that these agents have. But does that mean that nobody in the industry is lobbying towards a model where you use personal information about the driver to penalize buyer drivers and not get good drivers subsidize the buyer driver? No, there have been talks on for a long time. Because if the industry is trying to go in that direction, then they will also have to move away from a paper chart-based model and give an Android phone or something to the agent so that they can actually use the driver ID. Yeah, actually the even bigger question in this case is the data. How do you collect that data? How do you store that data? Right now these companies have not done any of that. So even if this was against legal regulations, if the companies had just been collecting this information, they would have a lot of historical data on which we could build models so that when the legal regulations come in, we have the data to analyze and build models on and predict what kind of pricing we can do. In this case, because there is no data of insurance companies, at least the ones that I've interacted with, none of them have stored any kind of personal information. So they'll have to start now and build over the next three to four years before you can build any kind of good models on it. In India, getting a driving license number or a pan-card number, not making both of them mandatory, but asking the customer to get either of them would be a possible solution. People are giving those two numbers anywhere right now. Now with the UID also, I think a lot of the personal information can be easier to capture. So I've heard of this in at least the U.S. for cars. They have the concept of VIM numbers, VIN numbers. Right. Legal identification numbers. So usually that lists the sort of accident history of the car in question. That's your buying car. You don't know whether you're the third or not, fourth or not, whatever. But you can, by getting the VIN number, which is a unique number, or the car itself, where it was manufactured, what accidents it has gone through without, disclosing personal driver information that has been written by it. So is that sort of, do you just, in India, is that sort of, you know, something that you guys factor in, you have access to that kind of information, or do you, you know, is that a blind spot? Okay, so what you're talking about is essentially a bureau of information where, you know, all these insurance companies are coming together and sharing their data. So that even if a car, an automobile has moved across, you know, insurance providers, the aggregate data is still present. So I think that it moves on for this. I'm not sure what the status is right now. As far as I know, there's no information being shared across insurance companies as of now. Bands, you all must be aware, have started doing this. So, you know, if you take a HDFC credit card and you default on that, if you apply for an HSBC credit card, you'll get information from HDFC and you're no longer allowed to, you know, you won't get a credit card. So those things are happening in the banking sector, I think it will come in the insurance also pretty soon. In the insurance sector, actually, they are actually engine number and chassis number for the vehicle. Those can be used without introducing a new number like a brand number. No, you have the numbers, the identifying information, but you just need all the insurance providers to come together, you know, come in a bureau format where all the information is aggregated and shared. Right now, just the information can be passed from the insurance company to the new insurance. Okay guys, so, you know, just the last point, there are different factors that we could have considered in this model. Type of vehicle, vintage of vehicle, more granular geographic information, maybe by district level, you know, we'll get a better read. If we could get driver-related information, then age of driver, driving history, single or married, they talk without kids. These are very strong predictors in the U.S. market because they have this information. If we start using personal information, these are other things that could really improve the models they feel. Okay, thank you guys.