 Saya sedikit mengenai keputusan bahagian dari data science to decision, yang adalah, ini yang kita ada di sini sekarang. Sebelum saya berhasil kerja, sejak saya bergabung sebulan dan setengah tahun lalu, saya telah di Tekstam Analytics dan kemudian di data anonymisation. Kita sedang membangun bahagian yang lain, untuk mengenai data anonymisation untuk mengenai data sharing. Untuk pakaian saya, Pytons Park dan begini Scala. Jadi, itu semua tentang saya. Jika anda berminat, anda boleh pergi ke pakaian saya. Jadi, kontennya... ...tidak ada. Jadi, masalah dengan... ...pengguna-pengguna-pengguna yang kita lakukan dengan doktor adalah... ...sebab kita dapat banyak feedback... ...dari peperiksaan... ...dengan banyak agensi... ...berita-berita-berita yang anda ada... ...dia akan ada banyak... ...pengguna-pengguna, peperiksaan... ...semasa ada masalah, anda akan melihat. Jadi, bagaimanapun, bagaimanapun... ...pengguna-pengguna yang kita lakukan dengan doktor adalah... ...pengguna-pengguna yang kita lakukan. Jadi, bagaimanapun, kita... ...menjelaskan projek ini... ...untuk menjelaskan apa yang kita boleh lakukan dengan... ...pengguna-pengguna yang kita lakukan dengan doktor. Dan, saya... ...saya menghubungkan ini. Jadi, ada banyak... ...banyak data... ...dan kita mahu melihat bagaimana kita boleh menggunakan untuk... ...menjelaskan, melihat senterbangan di atas... ...menjelaskan. Ada apa-apa masalah tentang... ...berita-berita yang kita boleh... ...tentang membantu... ...pengguna-pengguna yang kita... ...pengguna-pengguna yang kita lakukan. Jadi, kisah ini saya akan bekerja... ...dan kita akan menjelaskan dengan... ...pengguna-pengguna ini adalah... ...pengguna-pengguna... ...pengguna-pengguna di e-mails. Dan... ...kali-kali di e-mails... ...dia dari Channel HDB Sales. Jika kita semua mempunyai... ...pengguna-pengguna yang kita lakukan... ...dia dari... ...di Garfixider... ...dari Ketepikan di Asia... ...awak boleh masuk ke webnya... ...untuk mempunyai... ...dapat beberapa informasi. Jadi, untuk... ...pengguna-pengguna... ...pengguna-pengguna yang kita boleh... ...dia dalam audience... ...HTB adalah... ...pengguna-pengguna yang kita berpunyai... ...dia adalah... ...pengguna-pengguna yang kita berpunyai... ...untuk Singaporean. Tiga data yang menarik... ...dia dari.... dan yang mencukupi berat jauh dari BTO yang berkómusik dan PR dapat mengembangkan untuk pembahagian pakaian. Ini membantu kita dan meminta kita seMPan dipakai melakukan e-mails untuk menyelajikan yang apabila pakaian dilakukan. Berita-pakaian itu adalah meletak-detak panggilan ke pakaian ke channelnya. So, what the data that I received from HDD was raw e-mails from the CIM system. If you all have e-mails before, you'll be familiar with some of the problems that we face with e-mails and I'll go through that later. Other than that, they also sent to us some company basic centre demographics as well as the Internet Organization. Ini mempunyai kita untuk melakukannya sehingga kita telah meletakkan analisis. Dan e-mails yang telah diperkenalkan kita ialah RM270,000 dalam 3.5 tahun daripada tahun 2012-2050. Sebenarnya, e-mails kita telah diperkenalkan pada tahun 2017. Dari tahun 2014, kita telah meletakkan sekitar 100,000 e-mails setiap tahun. So, the obvious problem is that there's a lot of e-mails that can't possibly expect people to be looking through them. So, it came to us to see whether there's a more automated solution. So now I'll go through the analysis methods. I think for those of us familiar with NLP, the obvious approach to discover topics within e-mails is topic modelling. It's a very standard approach for this kind of problem. So, let me go through what topic modelling is about. It's actually an unsupervised topic discovery process. It was by looking at work call occurrence and the hypothesis is that words that relate to the same topic tend to happen together. And with the discovery of topics, then you can assign the topics you document based on the occurrence of the words within the document itself. So, this process is completely data-driven. It does not depend on human input. And the benefit of this is that one is unbiased. It does not depend on what I know or what the officers know. So, it's also able, at the same time, it's also able to review unknown topics which is what we want to find out whether there's anything that is happening that should be act on but is not really known at the moment. This side is just a bunch of jargons. Basically, topic modelling is a patient model that has some latent class. The latent class means that there are some hidden variables within the topic generation process. And these hidden variables are the topic distribution within topics as well as the world distribution within topics. And it goes through some expedition, my expedition process to discover all these things. But don't really have to know, understand what's hidden because happening underneath because they are quite implemented algorithms available within Python, Spark that can really be used for this kind of analysis. So, for now, for each document, you get distribution across topics. So, it's n by k, which is n will be the number of documents and k will be the number of topics that you set. And then, additionally, there's another output, which is the topic at the world distribution which is the probability distribution of the words across each topic. So, the dimension is k by v i d will be the set of vocabulary. So, the steps to do topic modelling are just to follow through. And these are some of the tools that you can use to do the steps on the left. Basically, the steps are going to the follow-up. Obviously, in Python, it will be a test result. And then, you do some text training, text processing. And most of the time, what you'll be using is regular expression technology, which is 3. So, you're just doing substitution to do some training. Then, after you've done training on text, you'll split the text with tokens. What I mean by tokens is, you can think of it as at the beginning of the words. So, if an email will be a huge chunk of strings, and you want to break it down into tokens before you can proceed to the next step, which is to create a documented matrix. And to briefly give an idea, a documented matrix is just a matrix where every row will be a document. So, if you have N documents, it will be N rows. And then, the columns will be the word count of all the words in the vocabulary. Each column will be an individual word. So, you can see the documented matrix is actually a very, very large matrix. But in essence, it's just trying to condense documents into an easy represented form. And you can also see from this is that the documented matrix actually going through this step, it uses a lot of information between documents, especially because in order of the words, there are words in the document. But you'll discover that actually the order of words doesn't really matter in terms of the topics that have been discussed because as well as the words occur together within a document, but they appear in the colloquial manner as in once one word appears, the other word will be using the same topics to tend to occur together. So, after you create a document matrix, you remove stop words because stop words are very common and they usually don't mean a lot of things out. What I mean by stop words are very common ones like the T, E, C, or these are what we call stop words. And also, actually, I change vocabulary, which means I remove the vocabulary that are rare. And after that, I think to the LDA model which is the standard topic model that is used in JLM. So, on the horizon, you can see this process used in the JLMC topic model. It's just one of the many. So, you create a dictionary and change the vocabulary and then, after that, you run the LDA with the LDA model. And then, these are the steps used in the cycle. It looks simpler, but actually, there's a lot of complexity even inside of the process where the LDA model are all steps. There are a lot of parameters that we need to control and how to do that, you need to be familiar with what's being done. So, what I did was just looking at online offerings to learn. So, because I have steps I don't actually have any knowledge of text analytics. It's just something you have to pick up on a job. I don't think there's any formal educational training to learn this. So, the problem phase. There are two main problems that I face with the earlier steps. I mean, this is very straightforward, but one of the biggest problem is text processing. And those of us in the field who know I've been familiar with this problem, which is data cleaning forms 60% of your work. And I will go through briefly what are the steps I need to do, go through and what's the problem with the text and what are the steps that I went through to clear them up. The other problem is modern interpretation. I will go through them briefly. So, the first problem with text is that the e-mails, right? Because they came from the CRM system. It contains a lot of structure within the e-mails. For example, the header, the footer, the signature text. And this structure is not consistent. So, if you're suddenly from Yahoo, from Gmail, from Microsoft Exchange, so it's not easy to pass e-mails. There's no obvious tool out there that can handle this job robustly. But at the same time most of this structure is not relevant for the analysis. So, I have to remove them in order for my model to do the analysis properly. And this is just an illustration of what you see. So, e-mail, a blood e-mail, looks like this. You can see there's a lot of structure around footer, header and if you look at this its whole e-mail is so long but the body which is just a blue buff I highlighted that it's just I don't know 10% of the life of this e-mail. The footer, the signature, everything makes up a hell of length on the e-mail but they're not relevant for the analysis. So, you need to clean all this away in automated manner I mean semi-automated across over 100,000 e-mails need a solution and more of the obvious solution I mean you can look at the e-mails and have something that stand up very obvious is they always say they contain very special, very good patterns they start with two, start with four or if the signature it starts from visit our website and all this so the obvious approach is to split the e-mails into lines and then try to remove all this with the real expression so basically I do a split by new lines and then I have a bunch of projects start with start with two start with four actually it's a huge bunch of projects that I I thought of as a luxury e-mail that's that's the problem of this approach I need to go through the e-mails to recognise the patterns and then associate them to projects to fit into to do the filtering but the benefit of this approach is that it's very besides you can know exactly what the patterns to search and remove the lines exactly so um other than footer and footer the the the thing for me is that if signature if text signature people always write their name their title all this they do not contain especially regular patterns that that I can use in projects to clean off because names I cannot use projects to remove names because names can also appear in the body so if I use projects I will remove everything the contain names of the officers then you may as a clean off significant chunk of the body so how I how I look at this was I recognise that signature text are usually quite short so if you look at the chart to the right side I think it's a bit small so what happens after I see the lines I counter um the occurrence of the linear space of the length of the character length of each line can I do a primitive proportion chart so this this actually is talking with the number of characters so the number of words of the line and this is the number of words in the line and particularly proportion also for ibu for lines and these 2 words they are different 75% of um sorry for ibu they are different 2 variations for 75% of all the lines in the data and for this one it's the same chart on character action and crazy um even an even line actually can be a chart it forms already more than I don't know and 3% of all the lines in the data so um you can recognize the length of lines there are a lot of short lines and these short lines my opposite is that they usually do not contain any meaningful content because if the line is so short they probably say thank you or they are probably part of the header even and I use this as a creeper to remove all those lines and um body with meaningful content which we are much longer than 8 words or 14 characters and use those as a creeper for my second creeper and then lastly this is the last step that I did was that I recognize that um there are some tags within the signature and the filter they are repeated frequently among the emails for example the disclaimer tags there is no tag that says this email is probably a combination if you cannot recipient you should delete an email immediately and all this um they are repeated frequently here on the other hand meaningful email contents are unable to be repeated exactly so emails that people actually type they seldom get repeated exactly unless they are part of a combination and this combination is repeated repeated a few times within the data so with this in mind um I did account of all the lines of the emails um to and then I created away those those lines they appear very often within the whole data and um just some plot that I did so just to go through the steps I counted the lines within the emails within the counter and then after that I counted the counts to get this plot up and this is just okay e-mail to so it comes so it comes so you can read all the lines and then it comes to that a few and second ones read the email read the data and then this kind of lines they actually they are almost a million of them so these are where the meaningful content I'm really excited and those that repeated very often are most of the time so it's your tags or send it back to blah and use this to kind of cut off to remove like say lines that are written more than a million times so so that's all I use to fill out the tags and next is the outputs as I showed just now an output from a copy model there are two same matrix and it's very hard to interpret them without without tools that you can use to visualise and this is component because the topics the topics that are output from the model they need to be labelled for the subsequent analysis to be useful and the topic labelling has to be done by the officers of the ground I'm not familiar with the issues that I've faced by HDB so it's better for don't change to the officers they are handling they say that you need to be doing the labelling so I need to be able to hand the output to them in the way that you can interpret I cannot just hand these two matrix now you all go and see what you all can do with this it's not going to work so what I did was was to do a visualisation this is I didn't do this visualisation from scratch this is an open source model it's called PIA-ODA base what it does actually is visualising the topics in a 3D manner to be done so these are the five topics and then the size of the bubble the visualisation the size of the topic we gave it the corpus and I can write the words for each topic on the right and these are ranked um I don't know so let me just quickly show the six just quickly give you a look of how the output look like so something like this you can clearly see okay for each topic what are the words and doesn't Sniper can do for you to organise the order of the words in different manner so I handed this to the HGP officers and they went through every topic and drag aside around to see about what are set up for the topic to visit the topic and then they gave the topics from there and then they came back to me with the topics and then from there I went on to do further and the next list okay until so some of them can appear beside this these are the jump topics but you can just see the topic from there actually what do you think which kind of text how from confidential privilege disclose disseminate the media I think it's obvious it came from some form of disclosure text ah where people will be entitled or this email is private from the media sure if you're not in Canada you sit here please delete it immediately do not disseminate so you can see how the keywords are there so what do you think the problem we're talking about at this time help you because we're at this group all this ah hardly or even words together together in the same topic and this is after the so we can put together a formal topic instead of being described within the other topics so the next is outcome after I've done the related topics ah it's what it's what I came up okay so the main topics are for students ah but you can see they are a cluster of four topics on the land they are made in the team collection there are some um but I would be surprised some people obviously are eagerly to ask for waiver housing policy that it's happening there on the discussion about definition of the leaflets so once it's announced there are a number of topics within the team collection and when you put them together and put them up um across the proportion of the materials that we see you can see actually more than 35% of the materials will be given to the team collection and it started at the time today it was obvious that um e-mails appealing for priority education um was a major part of the e-mails that they receive but they didn't realize that e-mails discussing requesting for team collection sharing name number of e-mails collection and asking for team collection now it got a lot of time so why do you think like this should bring professor Wan should be which is priority allocation waiver of the policies this is where we need the human input for something as simple as team collection share you it should be something that can be automated easy with website or something so um so they they're just doing something which I'll go to later okay and I'm just thinking but it's not really at all is was something like this we found that actually e-mail prefer request for early team collection on the third collection was kind of related to e-mails so the two roles at top actually are e-mail request for delay team collection and two roles at the bottom are for request for early team collection and then the columns are related to e-mail so the proportion are normalised currently on the go and you can see that for request on different team collection there are three high proportion on the top for the and on the other hand request for early team collection and more request for younger people we went back to them and the thing that this was interesting but they had a reason they proposed to declare why in the case it's because older people usually they have a flat they still try to dispose off so that's why in third collection because if you you must have so after all if you have received a flat you must you go to the flat again you get some time came and if they have not done so then they leave I mean I'm with people obviously their first home they're very excited so they want their keys earlier to to get that or maybe they want to get away earlier so what came out for this project was because of the large amount request for request team collection PGB access facility data efforts to to introduce flexibility in the access collection schedule I think they are trying to set up a website for people to do a self-service to access access collection schedule and the next find was that this project gave them a data a quantifiable and data supported overview of the thousands of emails that they received from the public and this is useful because actually for management because they do not look at the emails that the office start looking at and the officers are only able to communicate what they say is difficult to the management but with this the management were able to see clearly what is the distribution of the request from the public and they gave them a better idea what is happening at the low level they also help the officers have some sense check out what is the major and minor issue that they are based by the digital customers so in conclusion company modeling is one of the one of the having through that for financial management actually it is surprising useful despite its intensity I mean look at the steps you are not doing complex just counting the words and then transfer them to matrix and then put inside the algorithm and then it speeds up output and look at it so it doesn't use okay it does use some form machinery but it's not deep learning or doing some language passing that is more common in most NLP project however we start simple tool it was able to produce a high impact especially with my project actually without we are planning to scale this up into a self help self service tool for government agencies we are still in the process of creating this tool and we hope in the future more agencies can benefit from this kind of analysis so that's all I have for my sharing your opinion questions it's ready ya so just wondering for conversational tracks so like e-mails will see you know reply reply I do deal with that at all if not in my overall sample because we see you know repetitions ya that's a very good point I didn't do it in your honest it's very hard to pass competition because as I mentioned earlier the structure is very consistent so I included all the competition repetitions into the model actually in some sense it is useful for you know replies you know yes thank you and then there's no you don't get any context of what was what was the conversation about and the the conversation at Boston kind of help maybe in the topic in a more accurate manner although I agree that it's repeated but it did not I mean it was adorable say ya hi sorry I have to ask you two versions firstly 30 days unsupervised manner so how much do you pancake a result like I want just to know how do you actually choose the number of groups that you chose there because there's no way to go that there's the right number of groups and secondly is there a reason you use 30 days there are other matters like talk to grab or something or something like that thanks we're going to have more topics and how do you know which what is best this is there's no like I said there's no hard part to work so what I think was I trained a lot I trained a number of topics from 10 to 35 and then I completed a lot of them and how they accepted at 25 was using the visualisation tool so looking at visualisation if you look at this you can see the topics spread around across the two dimensions that they are part of here and I use that as a sense check to say that these topics are the same and hopefully they are more meaningful if the topics are part of the centre with one outline means that there's one very distinct topic and everything else is so and in this case after I do a monitoring I I see 35 topics seems to look the best so just based on feel there's no hard work so that was what I approach and as as to what I use LDA instead of job to work it's just based on what was more approachable what was the more approachable for me I'm not familiar we'll talk back at the moment so LDA was easy to use there were a lot of reference for me to look at and see how models can be people have can be adjusted or can be interpreted so that's why we thought Hi I just want to share my question is tell you and share that the science we do is not discriminatory and I guess my point is in the very sector data science is typically used for optimizing products so it's okay in science and in science of the audience but since this is government there are serious implications as to how government is responding to the citizens so for example something I mentioned like cutting out the shorter e-mail lines and as a worry that my disproportionate impact on the e-mail from people who are less good in literature who are very short and exist and very serious interpretation and how do you balance that? I think then the output the analysis we do is not a machine to machine competition it comes to a human I have to do a presentation to HDB and then after that they take action from that so there is always going to be human input in between analysis to have a sense check whether this is something that sector whether it is reasonable to act on so in some sense we hope that the human input will help us handle this for sure so we also have a lot of analysis analysis to not a sense check I just want to add a clarification to that so of all main e-mail that HDB is responding to by dpb kalau semua keretuan benda tj tj tj tj tj tj tj tj tj tj tj tj tj tj tj Kamu mengenai keadaan KFIDF untuk mengambil keadaan KFIDF atau untuk mengambil keadaan KFIDF? Dan saya dapat tahu apa yang akan berlaku pada keadaan KFIDF? Bagaimana KFIDF menggunakan keadaan KFIDF untuk membuat keadaan KFIDF? Okey, jadi kami akan menggunakan keadaan KFIDF. KFIDF, keadaan KFIDF itu hanya mengenai keadaan KFIDF yang akan dipercaya sebab keadaan KFIDF yang berlaku di sebuah dokumen. Ini membuat saya membuat keadaan KFIDF. Saya menggunakan keadaan KFIDF untuk mengambil keadaan KFIDF. Bagi keadaan KFIDF, mereka menggunakan keadaan KFIDF untuk mengambil keadaan KFIDF. Jadi bagi keadaan KFIDF, kenapa mereka menggunakan keadaan KFIDF pada kota mengeknur MSLDA blijkan karena dikerasamannya Y Gewb yang boleh dimulai darab kota. Jika mereka saya membuat KFIDF ini, Actually Kemi memakai keadaan KFIDF dan keadaan KFIDF. Tidak vadah. Sebenarnya, Di dalamnya, keputusan yang ditutupi adalah... ...sebab perlukan Rp5, keputusan yang berlaku terhadap K-Count, kentang LD... ...di mana ia akan berubah ke kentang... ...untuk berbunuh ke Topik yang diselamatkan. Jadi ia menerimak. Jadi saya minta keputusan. Bagaimana dengan HDP di dalam? Bagaimana dengan HDP di dalam saya? Saya rasa peringkatan ini adalah dalam keadaan perniagaan perniagaan. Jadi jika HPV akan membuat perniagaan untuk otomate, untuk membuat perniagaan untuk membuat perniagaan untuk membuat perniagaan untuk menerima perniagaan, HPV tidak harus memperkenalkan perniagaan untuk mereka. Jadi, perniagaan menghargai pilihan yang lebih mempunyai, seperti memperkenalkan, memperkenalkan, memperkenalkan, memperkenalkan, memperkenalkan, memperkenalkan perniagaan dan sebagainya. Dan ini adalah perniagaan yang lebih jelas untuk mereka yang dapat diperkenalkan untuk mereka. Okey. Saya rasa jika anda beritahu saya bahawa Kerana kategoris yang sebenarnya terjadi dengan kembali, jadi banyak kategoris yang terjadi. Apabila saya melihat kategoris yang terjadi dengan kategoris tersebut, 1.7% orang akan berkumpul sebagai orang lain. 1. kerana kategoris yang terjadi dengan kategoris tersebut, 2. kerana beberapa e-mails ada banyak kategoris yang terjadi. Jadi, kategoris yang tidak tahu bagaimana untuk dilakukan adalah kategoris yang lain. Jadi, bagaimana saya menggunakan kategoris yang berkumpul dengan kategoris yang terjadi dengan kategoris yang terjadi. Kemudian, mempunyai kategoris yang berkumpul dengan apa yang telah dapur. Maksudnya, ini adalah kategoris yang menghargai di kategoris yang dilakukan So, it kind of help me verify that the topics I discovered will make sense. So, that's all I would like to share with you guys. I am very grateful for this. Here I am standing by. We are doing in the government because of data science. Thank you very much.