 So, hey hi, good evening everyone, so I'm Kywin, so I'm an intern in the data and AI team in Singapore Power. So for the past three months of my internship, I have been part of the operations efficiency team and in the data and AI team. So the project I've been working on is on utilities, on residential moving detection using utilities consumption data and machine learning. So this project is part of a larger project which improves the efficiency of metering to building operations for SP services. So here's a brief introduction to my project. So electricity meter readings are read by meter readers and send to the server and uploaded to the server in SP once every two months. Once uploaded, this reading will be checked against the system of routes. If the meter reading is too high, it might be caused by a voltimeter and because of that technicians need to be sent to the premises to do on-site investigation. But there are many times whereby the meter is read by the meter readings are due to someone moving into the premises, so the meter is not 40. This means that the trip is wasted because it incurs manpower and other costs. So my project is to build a predictive machine learning model to detect these new moving events. By integrating this model into the daily operational flow of a meter irregularity investigation, this can help the operations team to identify if the meter is due to a new moving and this will reduce the false positives. These are the technologies and tools that I have used. So for writing my project, I have used Titan and Jupiter Notebook. And I have used Jupiter and I have used Suck-It-Land and SG Boost for creating my models and training and evaluating and then comparing to see which one is better. These are the steps that I took in my project execution. First I do data collection and processing. Next I do feature engineering. Then I do model training, model evaluation and lastly model comparison. So here is a walkthrough of the code in my Jupiter Notebook. So for the first phase of data collection and processing. So the first phase of my project is data collection and processing. In this phase, I collect the notification account and billing data and then I process this data to allow for easier engineering of features for each notification. Let me tell you more about the download file. So the download file actually contains the notifications that are triggered by the system of rules. So these are the fields in the download file. The device column refers to the meter that is associated to the notification that is being triggered. The diff column here refers to the rise in a specific utility that causes the notification to be triggered. Whether it is electricity, water or gas. The contract account refers to the account number that is being associated to the meter. The date issued column refers to the date in which the notification was triggered. And the cost code description refers to the reason why the notification was triggered. So now the upload file actually contains the result after the meters that are associated to the notifications have been investigated. So now let me explain to you about the fields of the upload file. So the control reading column shown here refers to the reading that was taken by the technicians after they have investigated the meter associated to the notification. So with these two files, I then proceed to merge them together based on the notification because from there I can see the notification and why the notification was being triggered. So after merging these two files together, I extracted the high consumption notifications by the cost code description to define it as being triggered due to high consumption. After that, I also extract based on the column diff, which is 01 because 01 means that it is being triggered by electricity, triggered by a rise in electricity. After extracting the notifications based on these two conditions, I then joined these notifications to the account moving information. And from there, I actually derived how long the account has been opened since it was first moved in, since it was first opened to the date when the notification was triggered. After that, I obtained the six consecutive previous months starting from the day the notification was triggered. Then I collect the billing data. The billing data contains the amount of electricity, water, and gas that was consumed for a specific period pertaining to a particular account. Then I joined the notifications to the post code, the premise where the account was opened, as well as utility statistical information. I further extracted the notifications by specifying the account type to be DOM because my project is on detecting residential moving events. So the last part of label generation, I generate a label to specify whether the notification is triggered by new moving. So the algorithm for doing it is if the written remarks column in the data set has the words newly moved in or new moving, I return a one. So the ones means that the notification is triggered by new moving. If the written remarks do not contain these words, I return a zero to say that it is not triggered by new moving. So here are some of the notifications and the written remarks. So as you can see here, because these written remarks don't contain the words new moving, I label a zero. And those written remarks with new moving, I label a one. In the second phase of feature engineering, I derive features from the process data set for my machine learning model. So let me tell you a few of the features that I've generated. The first feature that I generated was the ratio of electricity usage for the latest month. I feel that this feature is useful because this allows my machine learning model to learn the relationship between the electricity usage and the water usage as well as the gas usage. If I feel that similar usage patterns amount electricity, water, and gas means that there is a high possibility that the notification is being triggered by new moving. So here are the consumption ratios for the various notifications. And another feature that I engineered was the number of notifications being triggered. Because which is over here, I feel that this feature is useful because a high number of notifications generated for the same postcode within a specific period could imply that the housing play is new and this leads to a high number of new moving cases. Therefore, the high number of notification with the postcode that has many notifications triggered for the past 10, 20, and 30 days means that it is more likely to be triggered by a new moving. So here are some of the notifications that are written in the remarks and the details and the number of notifications for the past 10 days. Now the next phase of my project is on model training. So in this phase, I train two machine learning models, which are logistic regression and extreme gradient boosting, which is XGBoost. Before actually training my models, I prepared my data that can be used for model training. So these are the steps that I took. First, I extracted labels without any, without a value because without the label, we cannot do, the model is not able to learn. Then I also filtered out all the columns whereby there is no consumption for the latest month. Because it is important to know that, because it is important that there is actually consumption for the latest month, so I can tell whether it is triggered by a new moving. Then I defined the indexes to be used for the training and test. I then removed all the other fields that are not features and only retained the features and the label for each notification. I replaced the LNN and infinity values because the model is, because I cannot fit a data set with these values into the logistic regression model. After making sure that my data can be used for training, I then split the data set into the training and test set. So now I'm training, the first model that I train is the logistic regression model. So I trained my model by performing cross validation and then obtaining the most optimal hyperparameter value for my model. This is the most optimal hyperparameter value and its accuracy. The second model that I train is the XGBoost. Again, I perform cross validation and here are the parameters that I tested with. After performing the cross validation, I obtain the most optimal hyperparameter value which is here. The final phase of the project is on model evaluation. In this phase, I use the train model to actually make predictions on the test set. And here is the performance of both models. After evaluating both models, I proceeded to compare both models. So what I did was that for the logistic regression model, I changed the decision boundary for the new moving classification so that the logistic regression and the XGBoost model will have the same accuracy in predicting new moving. So in conclusion, I use the logistic regression model as the baseline model for training and evaluating my model because the project that I'm working on needs binary classification algorithm to be implemented. But the logistic regression model that I created was not able to predict non-moving very accurately. This can be seen from its confusion matrix where it predicted only 75 out of 260 non-moving cases. And because of that, I explored another model to use to reduce the number of non-moving cases being misclassified. And by creating the XGBoost model, the XGBoost model reduced the number of non-moving cases being misclassified because it only misclassified 14. Because there is a big jump in the number of cases being misclassified, I decided to choose the XGBoost over the logistic regression. So by integrating this predictive model into the operational flow, it can bring about the following values. Firstly, reduction in the cost of operations. Meter irregularity investigations are expensive due to high manpower and other costs. If a high meter reading is identified to be because of a new moving, this will eliminate the need for sending a technician on site. Secondly, more efficient use of resources. By reducing the number of false positives, the operations team will have more manpower and time to focus on the real 40 meter and meter temporary cases. This means that the operations team will have more time to focus on the due. This means that the operations team will be more efficient in dealing with the true cases. Thirdly, better customer service. An on-site investigation can sometimes involve the customer's presence, especially if the meters are in a private area. This causes inconvenience to the customers to schedule for a time when he or she is available to be around. Reducing the number of on-site investigations improves overall customer service. Lastly, reduction in safety risk. Reducing the need for an on-site investigation will improve the safety of the technicians overall. Safety is an important priority for SP. This is also an important benefit for the company. However, I feel that the following improvements can be made to further improve the model that I built. Firstly, I should retrieve the actual electricity consumption instead of the estimated runs. Because during the execution of my project, as a result of including the estimated consumption data in the future engineering phase, I realized that for certain notifications, there was actually no significant rise in electricity consumption for the too much nearest to the notification date. And because of that, it caused downstream problems because I was not able to further derive useful features using these values. The second improvement I can make is to use the document date closest to the notification date and to use the built period associated to that notification date as the latest month. In the data collection and processing phase, I made a mistake when transforming the newly matched dataset. I want, yeah, because in this phase I wanted to obtain the six consecutive previous months so I can generate further features on the utilities consumption for these months. Last but not least, for each model, I should actually train the models that are obtained from cross-validation on the entire training set for the models that I've used. First, in the model training phase, I missed out the step and jumped into model evaluation straight away. And these construction problems have affected the accuracy of the model training evaluation process. So I feel that in conclusion with these improvements that I implement, I can make my model become more accurate in predicting new moving cases. Thank you. How long do you spend working on this, SP? In this project specifically, I spend 12 weeks working on it. Just 12 weeks. Yeah. So quite a big achievement in 12 weeks. And before you came to SP, you never actually did this before as a used Jupiter notebook or anything. Have you used those before you came for the internship? No. No? Yeah. So you also do all these during the 12 weeks and during the internship? Yeah. All right, cool. Cool. Yeah. It's quite a achievement, isn't it, guys? Now hands to... Thank you. Thank you. So...