 in details, you can see in the documentation. So the libraries I'm using, I have put it on their pandas, sci-pi, keras, TensorFlow. All those libraries' documentation is online. So I will not be going into the details of them. I will just be explaining you how, you know, I have an interest in machine learning and artificial intelligence, right? And the beginning of this year, sorry, the beginning of, in the middle of this year, I was trying to change my home. And when I was trying to change my home, I thought, here, what would be the reasonable rent in this location? And then I thought, OK, let me try it. So, you know, I got interested in the problem and I tried to solve it. And then I will just discuss my journey into machine learning and how I did the things. All right? So let's start from the beginning. Just a brief introduction. OK, so I'm Vikrant Rathore. I'm working on my own startup, Comercio, which helps companies in managing their product catalogs. All right? So if you want to manage millions of product catalogs, OK, I'm building a product for it. So that is it about me. I don't want to talk in details. You can go online and check if you are interested in more details. So go further ahead. OK. So how many of you think machine learning or artificial intelligence is a panacea for all the problems? Nobody, right? Because if someone comes to you and tells you, I have got a medicine which can cure every disease in this world, what will be your thinking? You're going crazy, correct? But for every problem, we are thinking that artificial intelligence and machine learning will be able to solve it. But actually, this field is not very new. If you know this field has been there since 50s and 60s, today it is possible because we have large computational power and large storage at a very cheap cost. So what mathematical concepts we developed then, we can now apply them. Machine learning for me is very simple. Machine learning is making a program learn by itself a pattern from information. Obviously, they use linear algebra, computational statistics, and probabilistic theories to do that. But that is about it. But if you know a technology advanced is almost like a magic. So it's the same with this one, right? So this is what I described. So there are actually three kinds of machine learning which is very popular right now, if you know. One is supervised, unsupervised, and reinforcement. How many of you know all these three? OK, very few. I just give a brief about it. In supervised learning, we have a training data. We need a large amount of data. We label it. We label the correct data. We train the program. And then based on it, the program predicts on unseen information. Unsupervised learning is bit hard right now. And still it is very hard. Because in unsupervised learning, you want to predict without the training or labeled data. Still it's an evolving field. And then the third one is reinforcement learning. Reinforcement learning is learning like a child, which was popularized by Google by making a program in which a computer game can play a game based on a reward function. So the better you play, the more rewards you get. And based on that, they optimize. That is reinforcement learning in a simple language. Now, obviously, it is used in stock exchanges in other places right now. But still it's an evolving field. Now, you must be talking about components of machine learning. So these are the basic components of machine learning. One thing you should remember is that the program which learn by itself is also defined by humans. So if you think that machine learning and artificial intelligence can happen without humans, it's not possible. It will still require humans. Maybe I don't know. In future, it might be different. But today, whatever we develop in machine learning or this one, the target functions, the labeled data, the mathematical concepts still defined by humans. So there is a very large human component involved in machine learning. Now, you must have seen that Google developed a program, AlphaGo, to fight against the Go player. But when they were developing this program, they can't apply the same for chess. And it took at least 30 million sets of moves to learn how. And then once they beat Lisidol, then for the next one, they need to take eight months to again train it. So it requires a lot of work. And human component doesn't go away from it still. So how many of you are familiar with Python? Why I used Python? I picked up long, long ago Python. And I would not go into presentation now anymore. I'll go to another mode. Directly, I will start the notebook and go into the code and start working with the code directly. So you can see it? All of you can see it? Or it's OK, right? OK, I try. So this is starting a Jupyter notebook. It's a workspace in Python which will help me to develop any program on machine learning, actually. So this is my Python notebook. So now I use Python. Why I use Python is just this one. This fits my brain. Beautiful is better than ugly is the zen of Python. And this is the main reason I picked up Python long ago, when there was a choice to be made between Ruby and Python. Obviously, at that time when I picked up, Ruby was more popular than Python. Now it's another way. But at that time, Ruby and Ruby on Rails was very popular in 2004, actually, when I picked it up. All right, so this is the reason I picked up Python. I would not repeat it in this one. You must have read my description, right, that house rental prediction using URA data, urban renewal authority data. In Singapore, there is urban renewal authority which publishes the past data, right? So predict, right? Is this enough to develop something in machine learning? Is my problem statement enough to go on exploration and do something with it? So when you read it, what is your thinking about what I'm going to predict? I want to predict the prices of these locations and I'm going to predict those locations. Okay, if I say about prediction, you said that given a location, you want to find the rental, correct? But there are a lot of variables in it, correct? So what is... Then again, you go one level deep down. What I want to predict, correct? Some would say, prediction means, hey, what will be my property price three months later or one year later, correct? If I want to solve that problem, it's a completely different thinking, correct? Now I want to solve a problem that for me, I defined it more precisely. What I wanted to do was I wanted to give a square foot area and a district and I want to find approximate rental for a house, correct? This is what the problem I have defined for myself. You can go into a lot of other things, right? Now to define this, the problem I have defined, that given this problem, a computer program should be able to tell me what will be my rental. So I go top down. Instead of going bottom up, I go top down. So I have developed a web service which predicts rentals based on this re-information. So it's a simple flask app. It receives a web request and gives you the prediction, all right? The kind of things it requires is basically a simple thing, actually. So you can see here, rental district size 850, bedroom 3, district 03. This is the query I gave. It gives me a rent. 4,076 Singapore dollar, correct? And this one is coming from the program itself. It's not a rule-based one. It's using machine learning behind. The same service. This is using traditional supervised machine learning, solving a linear regression problem. Because since I wanted to predict what will be the house rental, I want to calculate a quantity. So this should be a linear regression problem, not a classification problem. So I use linear regression. Now, another one is through using TensorFlow in deep neural networks. I use a single layer neural network and here. You see the difference, right? One is predicting 4,076, another is predicting 4,593 for the same house. I will come back to it. Now, this is the service I have developed, right? How I developed it, correct? That's the question comes up, right? So the first thing is, you have defined the problem, right? Now, I have defined what I want to predict. The final thing, I want to predict rentals. What factors impact rentals? Can anybody tell me? Location, size, usually these are the two most common factors which affects rental prices, right? So I went on looking for it. So now, one thing you need to remember that what input you are taking to predict your prices should be available correctly, correct? Now, I was looking for the data and I was looking at urban renewable authority data provided by Singapore, right? So I looked at the website. So is the internet here? Can I connect to it? Because I want to show you in real how the data is and then how I work with this data to come to this point where there is a simple endpoint. Okay. So the median rentals property, property market, private residential property transactions not here. Oh, yeah. Here you go. So here is rental context for private residential properties. Here I select it. Okay. Okay. Here I go. Then I see search by property type, non-learning housing development. I select district 0 through 2 and search. So this is the data I get from URA, right? They provide me building name, street name, postal district, what type of property it is, right? Number of bedrooms, monthly rent, floor area, and lease commencement date, correct? Now, this is the data I have from URA data, right? So I want to first develop a program and I train it on URA data. I get the figures from this URA data, train the program, and then use that trained program to predict. That is what I was doing, right? Now, I have defined my problem already, right? I have defined my problem already, that I want to give just the postal district. I narrowed on my problem. I give the postal district size and maybe bedrooms. Because sometimes what happens for the similar square footage, if there are three bedrooms, they charge more than the two-bedroom one. This is a practical experience. So I think bedrooms are related to the prices, right? Now, when you start on a problem-solving, you don't have to have all the clear-cut idea that whether there is a co-relationship between a feature and the final prediction. Along the way, when you develop, you can actually find it. You don't have to be very precise in the beginning, but at least there should be some co-relationship so that you don't go off the road, otherwise it will take you very long time to develop it. All right, so I go here, I got the data. Now, how many of you have worked on linear regression problems before? Okay, so if you have worked on linear regression problem, what kind of data is mainly supported by linear regression? Means you provide a NumPy array, right? What kind of NumPy array values you provide in there? Okay, I tell you that when you want to develop a linear regression problem, all the machine learning algorithms or this one do not accept categorical data. Categorical data like Egg or Street, all right, like the district. These are all categorical data. These are not numerical data. All the practical examples you see on the Internet have a very nicely prepared numerical data for the problem, so your results come quite nice. But here in the real life, you don't have that thing. For me, the problem was even the square foot is in range. It's not in, you know, exact square foot. Obviously, I can subscribe to real-life data from URA and get a 20-year data, and I can train my program, and I can develop a much better program. It might take me a very long time to do it, maybe two, three months to develop this whole, but that can be done. So here I have defined my problem. Now I'll be taking three parameters. Which one? Postal district monthly rented this one. URA provides you to download this data in CSV. They had an API, they closed the API, they only open an API now, I don't know for whom. I couldn't find. Before they used to have an API to get the data directly, but now you cannot get. So I downloaded the CSVs, and let me go through the exploration, all right? You can see my code right behind, right? All of you, right? So now the first problem for me was loading the data to... So here you see, what did I do with this? I created testing CSV. Testing CSV is just one of the files which I have... CSV file which I have downloaded, right? I read it and read CSV with Pandas data frame, all right? And then from that data set, I only selected the things important for me, right? But I did not use this one, because what did I do it is, I read the data and load it into a db behind, SQLite database. So that one, what I will show you, the code exactly. So first I need to do some exploration of the data. What kind of data it is, what values are there. So... Can anyone, every one of you see this one? First one, don't bother about the first part, okay? First part is basically I'm creating the model and this one I'm using SQL Alchemy. If you know Python, you would know how to create the model and I'm using SQL Alchemy. I don't know how many of you have used SQL Alchemy. I'm using SQL Alchemy to a ORM mapper to read the data and write the data. So I will not go into all those details. Just the simple thing here, the important thing is, here you see, you are entered data set, I'm doing read CSV, correct? Here you would notice one problem. I'm using Pandas library, correct? So if you have worked with Pandas library, it automatically determines the data types from a CSV file. Right? So if you use the Euror data, there are problems. One of the problems you face is that your district code comes out as number, but you want it as a string because it's a categorical data. It doesn't have a numerical value. District 15 is District 15. It doesn't have any numerical value here or there. Right? So here, when I'm reading this one, right? I have defined d type is equal to postal district str. This is one of the parameters I find which will cast and interpret that data type as string. Okay? Then what did I do? It is I made a copy of the data frame only taking postal districts, number of bedrooms, floor area, and lease commencement date. Right? Then in the next part, I looked at the data. I looked at the data. The exploration I did is like this. I show you my exploration parts which I did like this before. So here, when I read it, right? I see the top 10 rows or top two rows and then see the data whether it is okay or not okay. And then I work with it. So I just run this one, okay? So here you can see. So obviously I check the data. What is there inside? What are the values are in the data? What are the things there in the database? And here is the data summary. You get. When you do pandas.describe, it will describe you all the data and take out the numerical column and give you the sum. So it shows me here. I think this one was for district 10. If I'm not a district 15. District, postal district 12, 13. Yeah. So here in East Coast area, actually East Coast area in this area as I did it last time. So in that one, it is giving you the mean rental, 7199 for all the data I have loaded. 25% have 4,000. 50%, 63,000. 75%, 9,000. And maximum rental was 54,000 from the data which I have taken from Eurorata. All right? The problem which I was facing while loading the data was, one of the problem I was facing is that in some of the square foot, there were no values. And I need some numerical value because I want to take square foot as a numerical value. Correct? There it is given as a range. Right? So what did I do with this? I just did a minor transformation of the data. Here. Here. First is Eurorental to load dot drop NA. What did dot drop NA will do will be is to drop all the data which is none, NAN. When I load the data, there are no data actually. So some of the rows have empty. It will automatically remove those empty rows from the system because for training, I need cleanly labelled data for machine learning because this is just for training purpose. I'm still preparing my model. Right? So first I did this one. Then in second part, what did I do? I do it is I interpret the area. So I took the range, divide it into two parts, range and then find the mean. So I took the middle value. If it is 1200 to 1300, it will be 1250, which is the mean of these two. So this is what I'm doing here. All right? I'm splitting the square foot area and then just finding the mean. So after that, what I'm doing it is even after doing this in some of the square foot areas were no values. Right? So then I applied a condition LAN area greater than 0. Okay? And then what did I do it is this is just a model. I'm loading this information to a database behind. Right? Now from all of this, what I have done is it I have prepared the data for my initial machine learning exploration. I can tell you if you're solving a machine learning problem, your 60 to 70 percent time will go on this one in exploring the data. Okay? Now I explore the data. Now what I want to do it is I want to evaluate which because there are many models in machine learning which are available as a library in python or in other languages using py torch. There are a lot of machine learning libraries available which you can work with. All right? TensorFlow is one of them. Then torch is another. I am using sci-pi here because as I said you I'm a python here. I'm using sci-pi a scalar. How many of you have worked with all of this? Okay. You have. Okay. So here I'm not doing anything special. Okay? The first thing I'm doing it is I'm connecting to the database which I have loaded. All right? Nothing special. Right? Then I'm doing a SQL query to read the data into a data frame. Pandas data frame. Okay? Okay. And then I'm looking at the first two parts of the data. The two upper rows of the data. So here you can see. Correct? The upper two rows of the data. And then what I'm doing it is, as I said you before, that every linear regression problem requires categorical, sorry, numerical data, not a categorical data. Right? But our district is a categorical data. Correct? So in pandas there is a direct function called getDummies which will convert your categorical data into numerical data. What it will do it is, suppose there are five values for five districts, 01 to 05, right? So it will make five columns. And then the district in which it is there, it will put as value 01. For others it will put 0. That is what it will do. So what did I do it is, I do getDummies and then again I printed the same data. Two upper rows. And you can see down from the same five districts. Correct? Any questions here so far? It makes it by itself the categorical data, right? Using getDummies function. Because it converts categorical data into numerical information which is required by most of the machine learning algorithms, right? Okay. Now, machine learning algorithms work on numPy arrays. They do not directly work on Pandas data frames, right? I use Pandas data frame for data manipulation. So what did I do is in the next phase is, I just take the data, get underscore data set dot values. It will return the numPy array of those values. So category data set numPy arrays value. All right? Now, what I did was, I divided my data into two parts, x and y sets. Correct? One is the feature data, another is the prediction data. So this is my rental, and these are the parameters. And then my program will learn from it automatically. Now, you must have seen there are two plus five. Total seven pieces of information there. Correct? Five districts, size, and number of bedrooms. Correct? So based on this, I see getDataSet1 to 18. Row zero is the actual renter. Then, now, the problem comes up is, there are a lot of linear regression algorithms, right? Which one to use? Correct? Which one to use, and which gives the better results for my given information, right? Obviously, before that, you will do one more thing, which I haven't shown it here is, you do scatterplot. Scatterplot, try to find out what is the co-relationship between the feature, like you say, size? How much size influence the actual renter? You can see it in the scatterplot, and if there are values which are too outside, you take them out. I did not show that part here, because that would be then again advanced, but once you go into machine learning, you will do a lot of that. So in that part, you already clean up the data. Then, in the next phase, what I do it is, I define the linear regression models. So I just import those models directly from this one. You must know that there is a simple linear model, random forest model, and I think I had elastic net also. Escalar and elastic model? Yeah, elastic net. So one is elastic net, another is random forest. I did not use support factor once, because they take a lot of parameter tuning, and they require a lot of time to train. And for purpose of a short meeting or one hour presentation, it is not good enough. It won't finish by then. Then what did I do it is, I first define the models. Then I run this models through validation to make sure that my models perform, how my models perform over all range of my data. Now how many of you have done validations in machine learning? In validation, what do I do it is, suppose I have 100 pieces of training data, 100 pieces, right? So I divide into 10 is to 1. So nine parts I use for training. One part I use for predicting. Then I calculate mean and standard deviation. All right? And then I do it for each set of data. So one set of nine data, then another set of nine data, then another set of nine data. But I don't have to do it manually. I can write a program to do it manually. But obviously, as I said, your Python is designed for all these purposes. So they provide cross well score. So it's a library provided by Escalar and PsiPi. I just use that one. I pass on the validation scores. I pass on the value of the models. And then I find out the mean square root error of my performing algorithm. So let's check it out. First I comment out the prediction part. Now I run it. Do you see the running? I run it already run. You see I removed the prediction part above. And then it give me. You can see the average error on elastic net is 1163. And 895 is the average error in random forest, right? Based on this data, what you would say? Which one is better? Obviously, you will say random forest, right? Because it is a less mean average error. Correct? Now I have decided, OK, I have a random forest model I want to use, right? I have tested it. It works reasonably, OK? It gives me approximate error of 895 Singapore dollars on an average, on all the data set, right? So now my problem is, OK, I have defined the model. I got trained in model. Now I need to do it is use this model for prediction. Correct? So now I go to the next step, OK? I test this to predict. So I define model 1, model 2. I'm doing both. One is the random forest, another is elastic net, OK? Then what I'm doing is dot-fit function is provided in dot-fit function is provided already in sci-pi library, which will ML library of Python, which will help me to train the data, so x and y data. So it will use my x and y data to train the model, OK? And then what I will do it is, I will print this information, right? So now you can see the prediction from this one, from the program. So the original rental was 4850. The predicted price is 4898, around $100 difference. You see the elastic net is 4842, 4850. So I tested it on 10 values, which is there in my data itself, correct? Now you must be saying that you have done the model, you have trained your model. Then next step is I save this model. What do I do? I save this model. So in the last part, if you look at it, I use this to save the model. What it will do it is, for next time predictions, I do not need to train my model. I can just use this model to do the predictions, right? So I use the JobLib. JobLib is a library available in Python, which will be able to help you. So just I save it, and I run it, it will save it. And then you can see it's here. This is what it looks like, OK? Now I save my model. Now I want to use this model for predicting my, I want to develop a web service that I receive this three data, and I want to predict using this three data, correct? Now since we have defined our problem that we need three parameters from the user, which one? One is the size, another is the number of bedrooms, and the district code. And based on that, my program will be able to predict, right? So I develop a web service which accepts these three parameters, load the saved model, and predict it. And this is the code. You see it here. So don't be scared by the code. Code is very simple. What I'm doing it is basically Ape.Route on my web service, if somebody calls a post method, sends me a data. Obviously, I'm using Python Marshmallow library. So I mean if you use Marshmallow library, it helps you very easily to take the information, validate it, and do the thing. So anyway, I don't want to bother. You see, I get the value. I read it, right? I read the value, district size and bedrooms, correct? I validate them. And then based on that, here, I'm loading the data, correct? And then here, I'm just predicting it. But one thing you must have noticed, that the input data which comes to me from the user is only three information, correct? But I change it into an umpire array. And that is all this code is about. And that's what you saw just now, the service running. So now, I show you. This is exactly the same service running here. And when I send it a data, you can see it here. I say district 0, 2, OK? Oh, sorry. Deep, no. I need to do the normal one. Because right now, I've shown you the code for normal one. All right? This was about machine learning, where I went from defining the problem, exploration of the data, developing a model, a machine learning model, and then predicting based on that machine learning model. Obviously, it's not production ready. It will still require three, four months of work, get 20,000, 20 years data, 15 years data, and work with it, correct? Now, this same linear regression problem can be solved by neural networks. Now, in neural networks, we work a bit differently. All right? There are a lot of theory you can read on neural networks. You can read a lot of details on the web, on their library available, correct? For me, the problem was very simple. I wanted to define a linear regression and use SGD, or Adams Optimizer. Means there are a lot of mathematical optimizers which I can use to optimize my predictions. And then the longer I train, the better the results are. I couldn't train it for very long. I only trained it for 20 minutes. Now, for using neural network here, now you have seen, I have solved one problem already. The data is already defined, correct? I have cleaned up the data. I have developed it into NumPy array. I want to use the same data for neural network. What is it? Because I have solved the data problem. I have defined the problem. I just need to apply it to the neural network model. So, first I need to create a neural network, correct? Obviously, I use Kira's library with sequential model, in which I can define a neural network. It's a single-layer network. It's with the basic topology. So, if you see here, I'm using the baseline model. This is the function I'm defining. N7, input dimension 7. What does this mean is that the parameters, which is coming to me from NumPy array are 7. Bedroom, size, and 5-district, right? And since I'm not using a deep network or wider topology, it is 7. So, this matches this because I'm not using deeper topology. In machine learning, what happens is that it does a calculation then pass it on to another, then again calculate and then optimize the function. You can read more details. But in this one, for me, the important parts were, you see, I use, instead of using SGD, Stochastic Gradient Descent, I use Adam's Regressor here. Optimizer, okay? Here, activation function, I use ReLU. If you want to read the theory why to use ReLU or why to use Adam, you have to read through the machine learning material. But this function does what is just basically takes the data and then perform my calculation and optimize the model. So here, I instantiate the model. Here, I print the summary of the model, how the model looks like. Then I print the shape of my data. What is the shape of the X data, Y data? Shape of X and Y data means it tells me how many values are there. We know actually what values are there. So, and then I run a fit. Now fit in fit, I'm using epochs and batch size. Epochs means how many iterations I need to do and in what batch sizes from all the data. So at one time, it will take 50 values from my data, do it 150 times to optimize it. Then again, it take another 50, then optimize it. It takes fairly long time, 7 to 8 minutes. I can run it for you right now. So here is the output when you print the model information and here is the shape and these are the predicted prices based on this fit. So here also, I run it. It's running now. It will take a bit of time. It will take a bit of a time to run and then you can see the predicted prices and the details in this one. My explorations so far gave me an answer that so far for a traditional problem like predicting this one, machine learning algorithms works better than neural networks and so far in all the results, I haven't displayed you the mean square error. I have done this validation also in this one. For this model also, I can define three different topologies. One is the normal 7 to 7, then I can define a deeper topology. Means I pass on, first I put 7 as a parameter, then I put 6 and 4. So I will have a deeper layered neural net or I can have a wider neural net and then I evaluate which model performs better. But there is one problem in this one. Since I had a categorical data, the amount of training time was very limited. The amount of mean square error it produced went into trillions or zillions. So I prefer that something which can fit in my brain is better than what it cannot fit in my brain. So I prefer machine learning more for this kind of typical problems. It is still running so you don't see the results but you will see the results as soon as it is finished, maybe five minutes time. Then in the meantime while it is running, I want to show you, I have now trained my deep layer network, correct? I have prepared my deep layer network. It takes fairly long time to train it and you use GPU, GPU processors to train it, correct? Once you have trained it, again you can follow the same process. Save the model. How did I do it? Just very simple. It uses H5, if you know HBase, the Hadoop HBase format. It saves the model and the data together in that format. So you know it optimizes some parameters, right? Those optimized parameters or coefficients are stored along with the model for prediction for unseen data. That is saved in this H5 file. It is still running so this is how it looks like. You can see here. Now once I save the model, I am doing just very simple. Loading the model in my web service and applying .predict, correct? So .fit is for training.predict is for predicting. Here, it finished. You can see it, right? Now, how did I use it is here. You can see it, the data. Again, the same one. I am getting the parameters, interpreting those parameters, correct? Still, I get only the same three parameters. And here, I am just converting the information into numpy array, all right? Then just passing the numpy array and then it returns data type of float, float 32 numpy array type. So I am coursing it into string. Otherwise, it will not be returning back the JSON because JSON, if I function, will not work then. Now, as I said you that in my previous meetup, actually here, if you want to really take a look at some of the topologies, I think I have it. One of the things which also I wanted to show you besides this one is, this one standard X, X, X. Standard scalar. How many of you know about standard scalar? As I said you that at the end of the day, machine learning is basically trying to optimize a function to give you a value, unseen value, correct? Now, you have X and Y two parameters for an example, right? Once value is 4000, another's value is 10, correct? And then you predict a function value based on 4000 and 10, which will impact the value of final more, the higher value, correct? So your predictions will tend to fall towards the higher value coefficients, right? So in machine learning, what we do it is to make all the information on the similar scale, we use standard scalar function between 0 to 1. So let's do this one and see what is the mean square error and mean square root error, okay? So here if you see our mean square error is 1163, 897, correct? I apply standard scalar, right? And I try to calculate. Similar, didn't change too much, okay? 899, 1141, right? Now there is another one which is, I think, standard scalar and robo-scalar. And you can see when I use robo-scalar, my error increased for elastic net, but decreased a little bit for a random forest. And this is how you really need to explore continuously to really find in your algorithm. Now we have solved the problem, right? That the bedroom and size, right? You want to make it more better. What will be your choices for making it more better? Ideas. To make your prediction more robust. Any? Add more variables. Add more variables. Like? Such as like location, in case if you don't have location, add the location. I'll remove that message. Okay, there is another problem which was there I previously told you is that if I want to predict the price, right? I need to know factors which affect the prices, right? Right? Now the problem is I want to be more closer to the real number, correct? Now in Singapore there is a very special situation that real state is mostly affected by the government policies. Like what is the ABSD, correct? Another possibility is maybe if I can know the number of arrivals coming in Singapore, number of foreign arrivals, because if you know, rental market is mostly driven by expats or outside Singaporeans. Singaporeans have 89.9% ownership. So I don't think Singaporeans can impact rental market very heavily. But if you take from Ministry of Manpower the number of foreigners coming in, that might be an indication. But you have to really explore and see whether it affects or not affect. Second part which can affect is you can see is that how much money is going into building and construction? Which data you can get from MAS? Monetary Authority of Singapore. They release monthly data on debt given to building and construction industry. Based on then somehow you can get, but there is a problem. It might introduce more problem in your prediction because maybe that is for infrastructure projects like building an MRT or building at this one because they don't separate it between those. Or I can take big property companies like capital and their reserves and incorporate it in my algorithm. I can do a lot of those things, correct, to see and test my hypothesis, whether it works, whether it does not work. But one problem we are not able to solve, how to predict future prices? So what will be your strategy to predict future prices? Because in this one we haven't taken that problem yet, correct? This only predicts what is my price will be today. And I want to predict future prices. How would you go about it? Anybody? You should look at the trends. Look at the trends, right? Means time series forecasting, correct? But in that one then you need to have a very precise time series, right? The prerequisite is that you need to have a clean and clear time series. But in property data, will you get that one? So maybe the trend can be shown, correct? But what factors affect the trend? We can incorporate them and maybe we can predict a time series. But if we take, as an example, I was thinking once upon a time, that was my own exploration. Can I use GDP as a way to predict future? But my problem is I cannot because GDP itself is a prediction. It might be true, it might not be true. If I take number of foreigners' arrivals to predict future, that is also a prediction in itself. So if that is true, then only my value will be coming out to be true, right? So I stumble upon a lot of these problems and then I realize ultimately that if you want to define a machine learning problem, you need to have a clear cut understanding of the problem, what you are trying to achieve with that problem. And this is about my talk about machine learning. Any questions? It's longer than that. Correct, that's the problem. I was thinking that the month can influence because they only give by month, not by date. Not by date. No, they don't have date time. They only have month. As I said with the data, right? They only have the month and the year, right? And month is a very broad category to predict. And then month also becomes a categorical data again. Yes. So how will you handle time? Okay, then how will I handle time is very simple. I will convert Feb 17, Feb 18, Feb into 0 to 12. I will transform the data from 0 to 12, correct? And then when somebody is asking for a price, I will implicitly use date time to get today's month and put it as a parameter in my machine learning, supply to my machine learning to predict. I thought of giving the demo, but I was not having enough time to do it in the code. So my purpose for this talk was mainly to give you an idea how you can explore machine learning and convert into a web service. Once you have developed a web service, machine learning is done. You can have actually your online system running. So because the problem with me was when I was exploring all the literature online, nobody was talking end to end. Everybody was talking part, part, part. And majority of the machine learning literature is around how to choose the model, right? How to define the validation errors, what kind of optimization functions you use. Nobody focuses on the other parts, which are also as important, right? The application of it, how to use it, how you can deploy it. So that was, you know, I learned, and I thought, okay, let me share this one. I think that is it from my side. Yes, please. Yes. What drives you to be here rather than using the own-based approach to get the process classified? You're absolutely right, right? Actually, the... That was pretty much likely to classify to the future. You know, say, let's say that an exact value... No, but if you go over rule-based pricing, rule-based pricing, right? I'll give you an example, rule-based pricing. There are 15 districts in Singapore, correct? Then there are different square footages in Singapore, right? Then there are different bedrooms in Singapore, right? If you go by mean and standard type of problem, you have to write a lot of code, actually, just to define a rule-based system to predict for a particular square foot. Having it done using one of the tools for the same type of... Yeah, because you can, you know, try to do mean standard deviation analysis and based on that, you can predict some prices, right? Like the way we used to do in statistical analysis in the old days, right? That's not an exact price. That's what I'm saying. It's an interval. Yeah, that's right. It's a confidence interval. So, you can use Monte Carlo simulations or you can use those also statistical methods to come to this answer. The only thing is that this is quite easy. I believe it's quite easy and the results have been very, very good. Actually, if I have 20-year data, the results will be very, very close. It's already within 10% error margin without doing much. Without doing much, it is within 10% error margin, correct? So, I was thinking that if I can spend five, six months or eight months on it, solving problem, getting data, transforming, doing, I can come very close to the actual prediction. The problem you face in machine learning and artificial intelligence, there is one very big problem. Majority of this is focusing on concurrent neural network, recurrent neural networks in classifying images, in classifying, you know, your speech processing, understanding meaning out of the speech processing. In all of those problems, one thing is very clear. Your training data and the label data is there in the image itself. You understand? When you're training something, right? MNIST data is there, right? You work with MNIST data to train an image recognition algorithm, correct? Now, for image recognition, you only use image as an input. Image is everything you require. But when you're trying to solve a problem in the practical life, like, I want to predict stock price or I want to predict house price or I want to do this one, the data you get is, in itself, is a prediction. The data on which you base your prediction is also based on certain predictions. And there you need to really find what data to use, what data not to use. Like, if I'm doing voice recognition, my input is just the voice. So that's the reason in traditional machine learning architectures and those ones, those are the common problems which they solve and they don't need to explore outside data a lot. I find that when I was searching for this regression problem for house, right, I find five or six examples that's about it. And most of them are working on Boston data set. And somebody customized that Boston data set for Washington data set or for some other country data set. And I was looking for, okay, that doesn't work for me. I had to really go into it and see how to do it, then only I can find out. Like, give you a very common problem is I'm working in, as I said, I'm working in product catalog manager, right. Now the problem comes up is if you are improving the supply chain management, right, SCM. There, if you go to every customs, they use harmonized HS codes for import and export documentation. If anybody works in import-export documentation, HS code, right. Then for spend data and for procurement by the government, they use UNSPSC. UNSPSC has 87,000 categorization. Harmonized system, each country has its own categorization. Right? Now if you have a product, this product goes in which category you want to identify. Based on the product description, prices so that you can help in documentation that a document, a product comes in. I know which product belongs to which harmonized code, what will be the, it's a duty. So the system can calculate it automatically. Right? There, I need really, I am facing a lot of problem because, you know, if you go to, I don't know how many of you have used Amazon categorization. If you go to Amazon categorization, even country by country, they are so disorganized. Because one product can be in three categories and they give vendor a flexibility, so the vendor also goes to two more categories. Some product have a better description, some don't have it. So I'm facing a lot of those problems in my field actually, in which I wanted to reorganize product catalog, improve supply chain, you know, help the companies so that they can have, they can attack multi-channel very easily. And I'm facing the same problem. I'm really solving very fundamental problem of organizing information. I haven't gone up to that stage where I can really use this data to do some predictions. Right now, I'm trying to create product information which I can use in future so that if you move your mobile phone on a product, you can get the product price and the details and where it will produce and how it would work. Because at the end of the day, that will be driven by where is the product information is. I mean, that is it from my side actually in machine learning. If you, any other questions? Yeah, please. How do you think you will cover CNN? Does it work? I did not try that. I did not have enough time. How do you think that it will help? Maybe not in this problem. It's a typical linear regression problem. CNN, RNN might not help you in this problem. Based on the literature, I'm totally talking about literature. How do you use RNN? How is the prediction? I haven't used it yet. So I do not know to be true. But if you just do pure time series prediction, you need to have a very good time series data. So if you have log data from your servers or your access data from the server which you captured through a machine or a sensor, then you can use for time series prediction. But the data which humans input, you want to analyze the data and use it in RNN is bit hard. This is my experience. I don't know. Everybody will have their own experiences on dealing with data. My purpose for this talk was only that in a simple way, I can explain to you how machine learning works, how to use it, and how to work with it. Obviously, to go to production, if you want to solve any simple problem in production, it will take you seven to eight months of work. Or six months of work because you need to get a lot of information, a lot of errors. Is it full of the source? I can put it open. It's just developed. Because you know, in Python, in Meetup, I did not cover up the web service part. I did not cover up the actual usage that I showed the prediction based on a web service. So then I just wrote it today in the last one hour. Everyone, our time here is almost running out. I'm done. If you have further questions, will you be taking around for a little longer? Yeah, I'm here for a little bit. But this venue will be pushing down by 9 o'clock. So if you guys have any further questions, you can move down to the bar downstairs and you can ask them questions. And it was kind enough to like say, thank you everyone.