 My name is Thuong. I'm working as a research scientist at Trusting Social. Before going to my talk, I just want to have a quick introduction about Trusting Social and what we are working on. Trusting Social is a company who delivers data science technology for financial inclusion. We do great scoring using alternative sources of data so that we can access great worthiness to the whole population. We have offices around Asian countries like the headquarters in Singapore, office in Ho Chi Minh City, a few in India, and research team is in Ho Chi Minh City and Melbourne. I am based in Melbourne. Here is the outline of my talk today. I will divide the talk into two parts. The first part is the motivation of the company. We will start with the great scoring problem and great scoring using alternative sources of data and the economic scale of our solution. In the second part, I will focus more on the technical part with the brief introduction of prediction modeling and the challenges that we face so far when we deliver machine learning and data science to this problem and how we work through that and what we are working on at the moment as well. First, I prepare the talk for a general audience so I just want to have a quick introduction of great scoring. Some of you may have known this topic but basically in finance, learning is the main business of the institutions. But if the banks or the financial companies can provide a loan, they need a way to estimate the level of risk associated to the consumers. So, by definition, great scoring is the process to evaluate the risk of a consumer, both of defaulting on finance application. So, to give a bit of context, at the moment, FICO is the biggest credit score provider in finance and if you work in banking and finance, you must have known FICO. So, as a statistic, every year about 10 billion FICO scores were sold and it averaged every day at 27 million credit scores sold by FICO. And put in the context of the US, 90% of the lending decision was made used by FICO scores. So, to have a great scoring model, this is how they work. So, at a prediction, as an observation time and normally this is the time that people apply for the loan. We want to predict whether they will default or not in the future. And what they need is the history of the person and in FICO solution, they need the credit attributes about the person. For example, I will show in detail later and then using those attributes, they will predict whether in the future that particular consumer is going to default or not. So, what do they need? What they need is basically the financial behavior of the consumers. Some examples are the payment history or the amount owed or the length of credit history. Some other things like the new credit or the types of credit that the consumer used and some other public records that they can obtain publicly from public records. So, given that information, can you give me a problem with FICO? What is the problem that FICO may face, for example, put in the context of India? What is the problem that they are facing now? Lack of data. Lack of data, but in particular what kind of data they lack? Any other answer? So, in the US, most of the people will have the bank history. They are going to have bank account and they have long history. Most of them. Of course, some of them may have never taken the loan. But put in the context of India, for example, most of the population haven't got any record in the bank and they haven't never taken the loan. So, how can they actually assess the great worthiness of the consumers if they haven't got any credit history? So, that is the reason why we are here. So, this is the issue with FICO credit scoring. Because they cannot assess the great worthiness of the consumers who haven't got any banking or credit history. And worldwide, there are about 1.5 billion adults that haven't got any credit history. So, basically, FICO score will keep those consumers out of the financial look. If they do not have credit history, they don't have credit worthiness. And the bank is not going to give them the loan or they give them the loan with a very, very high interest rate. So, basically they are excluded from finance, financial assets. So, what is the solution? If we face that problem, can you think of a solution? What can we do? Anyone has an answer for this? Any answer? Yep, we have to use alternative sources of data. Because if you say that you need credit history to give a great worthiness, then we can't solve that problem in developing countries like India or Vietnam or Indonesia. That's why we have to use alternative sources of data. But what are the sources that we can use? Can you give an example? So, some company, they use social media data, but that direction is basically very, very hard now because, for example, Facebook closed APIs in, I think, in 2014. So, we cannot, basically we can't grow the data from Facebook anymore, especially after the Cambridge scandal. So, we, at Trusted Social, we partner with, we partner with the telco data, the telco, to use the telco data and also combine with other sources of data as well to assess the great worthiness. But the main principle here is that traditionally, we use financial behavior of the consumers to assess their worthiness. We are going the other way around. We use their behavior overall to assess their greatest worthiness. The reason is that behavioral data is much more richer than financial behavior. So, financial behavior, you only have their great history and how they pay the loans and things like that. But behavioral data is capturing much more about the person. We can continue about this topic offline. So, the effect of using these alternative sources of data, we have a few effects. So, basically, we assess the great worthiness based on behavioral data. And we have already proved that our scores is better than the other scores using financial behavior. We can cover the whole population and especially we cover the unbanked population. That's why in our slogan of the company is that we provide financial inclusion for all. And we also can scale the business much more than other solutions. So, to give an example, in the previous slide, some previous slide, I showed that the free-code score can only assess the great worthiness of the consumers who already have the greatest history. But because we can give the greatest score for the whole population, so we can provide better 5 grand customized product to the consumer. So, for example, if we can assess the great worthiness with a good accuracy, we can confidently provide a loan to the consumer with a lower interest rate. And so, in the second part of my talk, I would like to focus a bit more in the technical detail. And as a research scientist in machine learning, my talk is going to be favorite in machine learning perspective and how the challenges that we face so far in the journey. So, because the talk was prepared for the general audience, so I just want to have a very quick overview of prediction modeling. So, in prediction modeling, what we want to do is we want to predict an outcome for some example. So, in this scenario, in our problems, our goal is to predict whether a consumer is a default or not and which one level of confidence. So, what we need, so this is the whole process of training machine learning, our prediction model. So, what we need is a set of training examples and of course, we have to extract the features from that set. We need to fit that into a supervised model like a logistic progression of random forest. Those models are going to require some parameters and at the first point, we need to initialize the values for those parameters. And then the model is going to give some prediction for the example that we have in the training set. Based on the prediction of the models and the labels, the ground truth that we already have, the learner is going to improve the parameters to provide a better prediction. And we go into this process until we satisfy with the performance of the model. And after we train the model, we test them by a test set where we haven't seen them during the training phase. And we use the parameters that we already trained to do the prediction. And then you compare the prediction with the labels to evaluate the performance of the model. So, going through this process, what do we need to make this prediction model work? We need, first is the data, of course, the data with the labels, right? And then from the data we need to extract the features, we need the labels, and we need the models. And of course, we need computational first facility to run this process. But each of these elements comes with its own challenge. And I will go through a few challenges that we've faced so far. First is how we can obtain the label data. The second thing is especially put in the context of great scoring, we face a class imbalance problem. The third challenge is complex and noisy data. The fourth challenge is the curse of dimensionality. The fifth one is the huge volume of data. The sixth is the reliability of the model. And the last one is conceptual. So, I will go quickly through each of these challenges. So, the first challenge is how to obtain the label data. This is the most important part because this is necessary to build any machine learning model. Of course, I put in the context of prediction model. And the golden rule here is garbage in, garbage out. So, how we can obtain those data? We have to build partnerships with different stakeholders, including the banks, the telcos, the financial institutions. But it's not an easy part, task, and we have to spend a few years to actually come up with these partnerships. The second challenge is the imbalance data. So, the thing is to give you an example, if we have a set of 10,000 loans, we have only maybe 5,000 to 600,000 loans that are default. So, the rate of default is very, very small compared to the whole number of labels that we have. So, there is no actually model who can deal with this imbalance problem easily. So, and there is a very fundamental limit on what accuracy we can predict with these kind of labels, and we have to accept that. So, the lesson learned with this challenge is we have to be mindful of not only the data, the features, and the labels, but we have to be mindful about the imbalance of the data. If we do not use a suitable measure, performance matrix, our model is going to be screwed up. And the second lesson is we should not use the machine learning models as a black box. We have to customize them to suit our needs. The third challenge that we face was complex and noisy data. The data that we have are quite bad in the sense that there are a lot of missing values and a lot of duplicated and correlated columns or sources of data. So, to give an example, we have the data from the telcos. We have the code and SMS transaction. We have the top-up transaction or value-added service. And most of these data sources are noisy and unstructured. So, I want to emphasize again here is the golden rule here is garbage in, garbage out. Even though you have a very, very good machine learning models, for example, actually boost very, very powerful machine learning models. But if you do not process your data well, you don't have a clean feature set and a clean label set, your model is not going to work. So, how did we deal with it? So, first we have to rely on the data engineer team to do data cleansing. Very, very, very heavy job for them. And then we need the data team to do the feature engineering. They need to understand the characteristics of the data to come up with a set of features that they can extract from the data. So, but this process also comes with its own problem. We can, we can generate up to maybe 10,000 features. But, so with those large number of features, what are the problems that they may come up with? That is called the curse of dimensionality. So, because we have a huge dimensionality and the high level of sparsities. Some features may only like cover 10 or 15 percent of the population. So, the level of sparsities is very, very high. But we cannot drop those features because we are not sure whether they can contribute to the models or not. And also, because we extract them somehow like using statistical functions. So, many of the features are strongly correlated to each other. And with that kind of data, the problem that we face is it's very, very difficult to train the machine learning models. And it can be easily for overfitting with our dataset. It produce unstable models and it's difficult to scale. So, for example, if you want to predict the race call for example of 200 million consumers, how long your machine learning model is going to run for the just prediction phase only. If you feel, if you use 10,000 features. So, what should we do with these problems? So, first in terms of infrastructure, we need to have the best system that we can have. From a big data engineering perspective, we have, we should have the best solution that we can. But it still cannot solve the problem. So, we have to come with machine learning solution. So, we have to do feature like selection. A lot of techniques that we can use feature like selection. Or we can do semi automatic feature engineering. And all fully automatic feature engineering. The fifth challenge that we face is scalability. So, to give you the context of what kind of the data that we deal with. In some, in one of the telco that we have, we cooperate with. They have 50 million subscribers. And they produce a few terabytes of data a month and millions of records. But, moving to bigger markets like Indonesia or India, a telco can have up to maybe 250 million subscribers. So, you can imagine how is the size of the data that we have to deal with. To be honest with you, I don't work in this field. So, if you are interested in this topic, you can come to our booth and talk to our big data engineers. The sixth challenge that we face is reliability. So, typical approach to evaluate machine learning models are, we can use K4 gross validation. So, from the training set, we split the data into training and valid set. And we do that for 10 different subset of the data. Or we can even hold our test set. We don't touch them at all during the whole process of training including the phytoning parameters. But is it good enough in our situation? The thing is, we need to convince not only our team internally, but we have to convince our partners, the banks, the financial institutions that our credit score is reliable. So, we have to run intensive evaluation process with our partners to evaluate our performance. The seventh challenge that we face was concept drift because of many reasons. The word is dynamic. And the behavior especially in telco data, the way that we make the code or sending the message, are different depending on manufacturers like the culture or background or the country or demographic groups. Especially we face that very serious when we move to India. And the behavioral challenge patterns also changes over time. Because of manufacturers like seasonality on social or commercial events like you may have a festival coming up or even it change due to personal situation. So how we can deal with this concept drift? So we have to analyze the bias and variance in the data and in the distribution of the data. And our experience is that if our concept drift is high, then we use a low variance model. And the other way around, if the concept drift is low, then we can use a low bias model. So to summarize my message here, there are a few lessons that we learned throughout this whole process. That the real data is always complex and noisy. And the golden rule, even though I'm from machine learning background, but the golden rule always is garbage in, garbage out. Even though your machine learning model is very, very good, but if your data, your features and your labels are not clean enough, then you won't have good result. So that's why data cleansing and feature entering is very crucial in our journey. The third lesson is we should not use machine learning models as a black box. Because I'm working as a research scientist, but also sometimes I take part in the recruitment part. And I can see that a lot of people now put in their CV that they are data scientist. And they put a lot of keywords in the resume saying that they can do this and that. But we can figure out after maybe half an hour talk with them that they only use them, let's say import from scikit-learn and then import and fit the data and train the models. But if we ask any further question, then they can't answer. If they can't answer those questions, then how can they fight to the models? How can they modify if the data is not working well? If the model is not working well with that data. So the rule is we need someone who really know what is under the hood. And even though we still use open source library to save our time, but if we need, we still can jump in and modify the models. That is the lesson that we will learn heavily throughout our journey. And another lesson is that the evaluation should not be limited in the simple trend test split. But we need to keep testing dynamically with the new data that we can have. And that is the reason is that the real data is highly dynamic. And the philosophy that we implement in Trusted Social is that we always hire the past person for the job who knows what they need to do. So with those challenges that we face, why Trusted Social works and can go through the journey. So we have different teams and I'm proud of the team members that we have. I can say that we have the best person for that particular job. For example, we have a business team who come up with the business problems that we have and the partnerships that we have with our partners. The data analytics team have a very deep understanding of the data and come up with the set of features that are working well with the models. The machine learning team can control and manage the advanced machine learning models and we are working on the cutting edge areas of machine learning community. The big data engineer team is very good in data governance and scalability. And finally the software engineer team can deliver the high quality system and products. So to give you a bit of context of what we are working on now, I will go through a few directions that we are working at the moment in terms of machine learning and deep learning. So we work on graph analytics because our data mainly like one of the important sources of data in our data that we have is the social graph, the contact graph between customers. And we also work on representation learning and a few other topics of deep learning like attention and transformer network and deep generative models. We also work on transfer learning how we can transfer the model from one domain to another domain. And we also work on computer vision and NLP. In terms of the products and problems that we solve, we solve a few problems in finance like risk assessments for detection, face recognition and identification for the know your customer purpose. And we also work on chatbot and robot advisors. That also concludes my talk today. So if you are interested in you can talk to me offline or you can send me an email. And this time for questions.