 Saya berkongsi lebih banyak tentang Club SWBIA, di mana saya bergurau dan... Ya, okey, jadi kemudian... Sebenarnya, SMABIA adalah sebuah club di SMABIA. Jadi, kita... Kami mempunyai sebuah sistem dikuasai untuk sistem analitik di SMABIA. Jadi, kami ingin pastikan bahawa... Semua di SMABIA mempunyai kemungkinan untuk mempunyai kemungkinan dan mempunyai analitik. Dan juga... Kami juga ingin berkongsi dengan... Mereka adalah orang-orang seperti diri, professional... Untuk menghubungi kemungkinan kita. Dan daripada itu, kita boleh lebih baik menganggap... Kami berkongsi dan membantu pelajar kami memahami... Berkongsi bagaimana dunia sebenar berkongsi. Untuk analitik. Sebenarnya, lebih banyak tentang kami... Pada masa ini, kami mempunyai... 700 kemungkinan kemungkinan di TIDI. Kami mempunyai... 200 kemungkinan di SMABIA di seluruh kota. Terutamanya, mempunyai kemungkinan di under graduates. Tapi kami mempunyai... Kami mempunyai... Kami mempunyai... Kami mempunyai... 32 memberi data yang mengharapkan untuk memperkenalkan kemungkinan... Sebenarnya, membuat kemungkinan analitik... Bagaimana dengan kemungkinan pengetahuan pengeri... Di R dan diutama. Dan pada masa yang sama, mereka juga menyebabkan beberapa projek... Dalam kemungkinan mereka sendiri. Pada masa ini, kami juga mempunyai... Pembuan industri yang akan diperlukan... Kami juga mempunyai... Di kemungkinan kemungkinan di sini. Dan... Sebelum kemungkinan, kami mempunyai... Sebuah meskipun yang berlaku... Memangkan... Sementara 10 projek industri... Dan kami mempunyai... 5 projek industri yang akan diperlukan... Di sini... Dan 6 projek kerja. Sebelum kemungkinan yang kami mempunyai... Untuk pembunyai... Pytone... Namha dan Pandas kerja... L.L.P. Pembunyai untuk tablu... Yang kami mempunyai dengan tablu... Untuk mempunyai... Dan juga, kami mempunyai... Pernasib untuk mengajar network baru dan baru... Namun, kami mempunyai kerja dengan AI Terra... Untuk tidak menyebabkan... Yang lain saya memperkenalkan untuk... Pembunyai... Teruskan. Jadi, kami berpindah untuk meminta... Pernasib untuk memperkenalkan... Lama dua hari... Dalam network baru dan baru, berlaku dengan pembunyai. Jadi, saya akan mempunyai... Pernasib untuk memperkenalkan... Pernasib yang menghargai... Pembunyai dengan pembunyai yang benar dan berkaitan. Dan kami ingin mencari lebih banyak... Bagaimana untuk mencari... Dan Captain was actually really a good first step. You guys are actually the networking session and industry talk. And at the editing events we have had, the alumni sharing we have had, DPS and 203s come now and talk to us about how and then we work in the backing sector. We have had expediently use analytics and how they use all the search engine marketing. We have had group of sharing more about the programs and internships they were offering, as well as IMDA sharing about what they do in the government sector, what the projects, what the opportunities and what were the scholarship that they were providing. So all these are avenues in which the SMU BIA aims to engage professionals like you guys. And so far we have this partner and if you guys are interested in working with SMU BIA, you can get in touch with us. This is a bit more in depth about our talent pool. So we do have a large group of data enthusiasts, the 200 members that we have. They are actively involved and interested in joining such analytics events. And the DAs that has spoken about area, the 32 dedicated members who meet on a very regular basis to develop their skills in a co-learning setting. And because this year was actually our first structured DA that we have had in the past AY and moving forward they will be the senior DAs. They are the trained people with advanced titan and tabloom, competensi and they will be helping to mentor the next generation of DAs. Our email last reach out to 9,000 students. As far as our Instagram is quite new, it has 100 followers and it's just working. On the website you will see us, you can go on to SMU BIA.org to find out more about the events that we have had. And also we are developing our medium channel where we post more about technical competency and how the student life is for our BIA members. So if you wish to work with us you can take a screenshot of this slide. So that's all our contact channel. We find rounded up, we are actually, SMU BIA is actually an IIT hub which is recent for the Institute of Innovation and Entrepreneurship. So they are actually holding an event on the 2nd of April 6th 30-8th 30pm. This is also open to the public. So if you guys would like to sign up you can scan the QR code here or you can contact us afterwards to sign up for this event. On this event you will be inviting Melvin to come the founder and executive chairman of MM2 Asia. So he will be sharing more about Singapore based film production and distribution company with global footprint and most notable producing the hit movie series Abois to Ben. So if you want this series and you want to find out more you can go to this ARD event where you can ask him anything. And before I close off we will have Gabriel come up to tell you who's next. I'm Gabriel, I'm here at the SMU. Thanks for coming out today. That's not quite proper to share at all. For if I share in currency we're giving about 10, 50 minutes to ask questions. So if you're interested in SMU BIA or hosting, we'd like to have you guys all here in the School of Accountancy where my report. So I just need to give a quick bit of information about some of our programs here at the School of Accountancy at SMU. I would have had our director do it but he's unfortunately traveling. Okay, so I'll tell you about our master's program which actually covers analytics as well as accounting. So we actually have this hybrid program where you learn both on the accounting side as well as on the analytics side. So as far as who we're targeting targeting generally more business related people. So if you say somebody in business you want to break into analytics or you have an interest to say in a business analyst room but you want to get more into the data that's sort of the type of profile that we're targeting here. We offer part-time and full-time version of that program and the intakes in August. As far as the curriculum goes we have a mix of accounting as the first column is your core accounting classes a couple of financial accounting a lot of material one on that information system. The middle column is more your core technologies you're in Python where it's visualization And then on the right side is the analytics content we're actually applying that So actually I've defined top works here forecasting forensic analytics that class as we like the stuff I talked about in today's talk we cover how to actually do that all in R and then something that goes. We also have some scholarships actually we have this SMU MSA scholarship which offers partial funding up to $10,000 and also thanks to UOB we have some for a full UOB SMU scholarship where they can cover the entire thing offer internship potentially full-time deployment So it's very nice very nice done So if you want to learn more you can go to this link here you can contact any of our organizers so you have A-Gen, Mabel or so if you have any questions ask them she has some brochures you can pick up you can talk to her as well about the program the actual content for today So today my talk is going to be corporate fraud LDA and econometric so LDA is machine learning technique the only machine learning technique I'll talk about in this talk of course others that are relevant So I'm going to go through these three topics I'm going to do them in order we need some grounding about what corporate fraud is and what we can actually talk about it so thank you all of you guys are coming more from the IT side you heard of fraud you kind of know what it is but you don't So if we want to actually figure out how we can detect anyone who's doing fraud we need to know what they're doing then we'll get into the technical part So our main problem is going to be this how can we detect if a firm is currently involved in a major instance of misreporting So when I say detect that means we're going to have some sort of classification problem when I say that they're going to do it currently it's also a prediction problem so we want to know today who's committing fraud These are companies that have been found out already these are companies that are actively doing fraud nobody knows who they are but we want to figure out who they are so that we can stop and report this to them And then this reporting is going to be what we're going to talk about Next, let's sort of the accounting side So we have this mixture of some analytics stuff that's these classification fiction and then we have the accounting side of the business side and we're going to fuse those two together To actually do this we're going to have a bit of business insight some economic theory some actually psychology theories as well So actually getting to the mind of what the management's doing and then also we'll have our more core techniques or statistics machine learning or econometrics As for why we care about this well fraud costs a lot of money just for the US just for the 10 most expensive frauds just for the shareholders they lost 13 billion only to one of these parties we're ignoring a lot of the cost we're ignoring say the cost to the country the GDP impact just one fraud alone Miss Enron was expected to have about 35 billion US dollars in impact on the US GDP for one year that's ignoring also their contribution to future also we're ignoring societal costs a lot of people lose their jobs a lot of people lose confidence in the economy have performance in the economy from a government perspective those are actually quite crucial there's also some negative externalities going forward so when you have a fraud a lot of times you'll have some regulation that comes out to try and prevent this again that's an extra cost that every other company has to have to comply so it's also trying to understand the economy a little bit that of course is actually just a raw total so actually if you think about it if we can just catch one more of these guys we're saving over a billion dollars per person per person if we can do this and ask we can save billions of dollars for the economy that's actually quite important so what is mis-reporting? simple definition so if there's just one thing you take away from the accounting side that's taught today it's not mis-reporting just when you have an error that affects the firms accounting statements or their disclosures and it was done seemingly intentionally because word intentionally very has been done intentionally by either management or some other employee at the firm so the key about fraud is that they did it on purpose so we're not talking about somewhere where company did something wrong it caused a lot of problems but it was a pure accident that actually happens a lot but it's not it's not like they were actually doing it on purpose there's sort of you can blame them but you can't really blame them quite so much it also tends to not have quite as far reaching consequences but when you have say a CEO or a CFO or some other managers who are actively trying to circumvent the controls in the company to try and make everyone think the company is fine and then it just collapses that's the same world account if you take a company that's worth tens of billions of dollars and it goes as far as how this is typically done your traditional accounting fraud is if you have a company that's just not doing well they say well let's try and find some way to make it look like they just find some scheme cover it up so say Wells Fargo from 2011 to at least 2018 they do Splungo they're just duplicating or making fake customers that's an easy way to say and then of course tell everybody that's investing in the company look everything's great and they will leave it focus plenty of other ways to do it so Dell for a long time they did the opposite they were doing so well they didn't want to tell investors just how well they were doing and so instead this hid a lot of the payments they were getting from Intel actually Intel's payments made up over 76% of their revenue of one quarter but they just didn't mention this and then at some point they stopped working out they try and cover it up with that but eventually the money dried up and then no bad things happen they did have options back-bidding Apple just didn't tell people exactly what their expenses were that was a lot of tech companies in the mid-2000s they could have religion hardy transactions as China and Northeast patrolling they made 176 transactions between the company and various family members just giving money giving loans to family members without telling investors what the money was for that's profitable or you could have some perhaps more interesting ones so CVS sort of like like your typical drug store in Singapore, like Wansons CVS so they're proper accounting for stunt animals so actually their account statements were over $20 million just because they didn't properly account for the stunt animal sounds a little funny but it's actually a serious issue it means there's $20 million they say they have some really weird ones like this Countryland Resorts or Wellness Resorts they're not a resort company or a mining company first of all and so they had this warehouse and they say okay in this warehouse we have a bunch of gold there's a tarp over it the other guy said yeah, it seems fine there's a warehouse there's something there gold, good then at some point the other guy decided to check what's under the tarp it was dirty they had no gold it has a mining company it was a problem so in that case the company completely fabricated their entire existence so we have all sorts of crazy things happening and we're trying to figure out what all those are and who's doing any of those and as far as where we're going to actually get data from this the US government conveniently actually provides this data publicly and it's not so usable for but it's there so we have say this US SEC AADR or accounting and auditing enforcement releases that's going to be one of the main sources because essentially anytime there's a really big fraud the US government just publicly shames that company with this type of document they say hey look this company is doing these bad things so we're just going to say look this company to all these bad things and just publicly tell everybody about it and not give the company sometimes companies release it on their own accord to file an amended annual report so it's something called 10AA filing the slash A means amended they're saying there's something wrong in the previous one use this one step also the US government has another channel called 13B Actions there's a couple other places notice inside any reports just trying to hide it or press releases there's a lot of ways you can give this data they're all subtly different so let's talk a bit about how to do it that way but as far as where that leaves us well one we know there's a lot of reasons for fraud we're just going to be kind of painful actually if we're trying to be on the analytics side we want to say how can we detect fraud we have to detect all of those things I mentioned and more all of these are frauds all of these are problems but they're all quite different also none of them happen very frequently we can't say break these up into individual types of fraud because maybe it's easier to have we can't get enough data to actually use classified data also I said there's a bunch of different areas to get this data but they're all subtly different so we'll have to be very cognizant of that and perhaps just use a bunch so instead of just predicting one type of measuring pretty much and we also need to be we're just we're getting into a hard problem so we're detecting quite varied and we don't have well-defined data so it's going to be a tough thing to approach so let's get into the more analytics side so as a main question it's just going to be how can we detect if a firm is currently involved in major instance misreported the way we typically do this back in the 1990s was pretty naive we decided well let's use some financial ratios if companies are misreported they're probably that worked for a bit and then stopped working there's a good reason for this essentially managers realize if you go above certain thresholds and certain ratios your auditors start asking a question beyond that and then you know after that the counter-research said well why don't we look at how they write their annual report maybe that's helpful so let's see if there is if it's overly positive or overly negative or maybe their report overly long or overly short maybe we can use that but that added a bit of prediction power actually but none of this new mama tell you guys about today it says well why don't we just look at what they actually said simple idea is if they're say manipulating or inventory they probably don't want to talk about the inventory it's pretty straightforward but that's the whole intuition behind it if they're going to manipulating something they're going to dance around it as far as all these if you really like if you really like to see those I just pick her ground probably in LL 2018 it's come to a bit available you can grab a copy about after the talk you can see all the details there it's like 80 some pages longer so it's quite okay as far as the issues that we're after to address so I said this is a hard problem we'll have a lot of things that we have to be very careful about but first we have to figure out Saya akan mengambil beberapa teknik metronik, untuk mengambil penjaga. Dua, ia akan jadi penjaga Smart Feature Design. Saya tidak akan melakukan semua itu. Saya akan cari napas ciklinen. Saya akan mempunyai machine learning. Ia akan menjadi cara yang baik. Sama ada yang terdapat model yang sangat sulit. Ini adalah satu masin yang mencari. Kalau anda mahu membuat perjalanan dan menyebut bagaimana hal ini bisa berlaku. We have a problem. Managers typically don't want to hear sophisticated techniques. Here's the model, the model has 50 variables. They don't want to see that. They just want a simple intuitive way to see it. And they also want maybe some other results to show confidence that this measure works. So we'll see how we can do that. We're meant to deal with predictive modeling. Predictive modeling in this case is going to be a bit difficult. As I said, we actually have very sparse data as well. And so we have to be a bit cognizant on statistics and econometrics side there. And we use a window designed to take care of that for back-testing. And I'll show you guys how that works. And then also just the infrequent events are going to be difficult to deal with. So we're going to have to also deal with that. So that's the roadmap for what we're going to talk about. Just a little bit about how the model actually performs. So here in purple, you can see the model that we proposed. And it's just a very simple measure. Just take the top 5% of companies picked up by the algorithm and see how many frauds they can have. For the 80 yards, that's what the US government says. This is so big. We want to publicly shame the company. Our model is 59% better than anything else out there. So just actually adding the spin of what companies are talking about actually adds a lot of prediction ability. In terms of raw percentage, about 8% more add 8% to the overall prediction. So in terms of some other things like 13Bs which are a little bit more varied, we actually even have a higher gap which may not measure in the other ones. And we can pick up other stuff too. The other one I want to mention though is right here, this AVR first year column means the very first year a fraud started, that's how many we can pick up. So the very first year, we're going to get about 15% of all frauds. That doesn't sound very great, but actually frauds are really hard to detect, especially at the beginning. They haven't started so long. The frauds over time becomes easier to figure out because it just starts to come out. And so actually there's a lot of frauds between the start of the fraud and when they get caught, they average to something like 6 years. In fact, 15% of these we're going to catch in year 1 is actually really important especially from a dollars perspective. If you, say, just plug this into your reporting system you can detect 15% off the map. Well in this case, that gives you an extra 5% of companies that are committing fraud that you never have to deal with down, say, 6 years from now. That saves a lot of time and a lot of money. So for the very defense, so these are just the past models. So just a simple few measures for these. The old model is that although it's financials, things like how big is the company? It's a blog of amount of assets they have. You have a percentage change in cash sales. So how many of their sales are in cash? A typical way to sort of manipulate this. Or are they having a merger or not? A lot of times, actually these mergers aren't necessarily for business purposes but for hiding, right? You have a complex transaction and you sort of toss everything in. That's what I feel in this fraud, for instance. Olympus use mergers and acquisitions to apply everything they were doing for like 15, 20 years. The whole theory behind that was just economic, right? We just say if a company is committing fraud, it's going to end up showing up in their counter statements to try and use that. The new models, the new, suddenly newer models using the style of their documents, right? Those include the things like the length or how repetitive the document was, work choice, sentiment, grammar, sentiment structure, those sort of things. The theory came from communications and just said there's some unintentional bias or unintentional bias either hidden bias or little manifest when you're trying to do something that's fraudulent. And then some of them actually were quite ad-hoc. They just said, well, maybe this works and it has another 86 variables. We have 17 financial variables, 20 style variables in the model already. The model that we're going to use is that it's going to look at the content of these documents. So we said financials, they're too easy to actually cover up. The style of the document is a little, it doesn't really capture that much. It captures some summary statistics but you're not getting into actual substance of that. So instead, we're going to take those variables from before and use it as a baseline. But then we're going to say we're going to quantify how much of each or how much these documents talk about in different things. Now, one problem with these documents is that they're like 20 pages on the low end up to 300 some pages in the high end. So complicationally they're quite painful to actually run. We'll end up training of five years at a time. Now, this is in part because I didn't have enough RAM to run the whole thing but it also works quite well because we're going to be back-testing and so we don't want to use any sort of data that might sort of paint our back-testing, right? We want to be very careful of how we design this. We're just going to use a five-year window where we design things. We'll run our machine learning algorithm within those five years so that there's no influence from outside years. Then, using 31 tablets per year, I'm not going to talk about that design decision but it's covered in the web appendix but it's not as far as why we're actually using this measure the idea was that we want to think more like these managers, right? If you're a manager and you're committing fraud, what would you do? Right? Let's say, again, if your fraud's an inventory, will you talk about inventory? My guess is no, right? If I was committing fraud, I was doing an inventory I probably don't want to talk about inventory. I'm going to dance around. And so this is a measure that's going to naturally pick the guys up. That's the whole idea here. Now, some examples of what these look like so these are straight from the paper where we would say one topic it's about four aerospace things, titanium, aerospace and such. There, one banking insurance topic, general business activities, internet group together there. That topic. Just to show you guys what this sort of algorithm pulls out, right? These look reasonable. These words sort of belong together in general. As far as how we do this, it's this LDA method I mentioned up front, right? This stands for Lake Grishland Allocation. You can pretty much buy this in any programming language you use. No matter what program you use, there's implementation these days. Also, you probably interact with it sometime today. Google search one small component of that algorithm we look at articles. They suggest articles to you. They use LDA to do that. You go to Twitter. They suggest you users to go follow. They're actually using LDA on the tweets and comparing that with the tweets that you like. You actually interact with this type of algorithm all the time. If you want to implement this in Python, I would recommend Jenssen. It's very convenient and also multi-threaded nowadays. You can use our STM for doing it. But unfortunately, we did not have these packages back when I ran this. This paper has been in works for a long time since 2013. So we used actually code straight from David Blaise or David Blaise website that created the LDA. But what LDA is going to do is it's just going to read all these documents. All you tell it is they say look for 31 topics or look for however many topics you think there are and it's just going to go read everything and then it'll report back to you quite intuitive. As far as implementing this, there are some difficulties though just in this specific context. So we have these things called any reports. So they said it's just a big year-end document each company puts out for any public company anymore in the world. We have to make one of these. In some ways, everything this company tells you about management tells you about management. A lot of stuff in this document. Unfortunately, businesses when they're doing these documents don't really follow any consistent way of doing it. Some companies do fix with text files. Some companies do proper HTML documents which are quite nice. Then some companies write it in Word and then export the Word file as an HTML file. I don't know if you guys have ever worked with Word files export to HTML but they're horrible. Absolutely horrible. And unfortunately most of these documents and also sometimes they haven't met taxidasimal code for images and all sorts of other things that you wouldn't expect to be there. The way to deal with that is just a more regular expression. Just tons of those. Actually, I tried using I think the I think the I think the I think the I think the I think the I think the I think the I think the I think the I think the I think the I think the I think the I think the Jeffreyざ j оз jela jela jela jela jela jela jela jela jela jela agresi akan membuatnya agresif. Dan kemudian yang lain adalah bahawa ia akan menjadi seronok dengan perniagaan industri. Dan itu bukan pilihan yang kita gunakan untuk perniagaan kita. Jika kita mempunyai perniagaan kabel dan mereka bercakap banyak tentang kabel, kita tak peduli, bukan? Ia adalah perniagaan kabel. Kita ingin tahu jika mereka adalah perniagaan kabel dan mereka bercakap tentang apa? Perniagaan kabel lebih daripada perniagaan kabel. Jadi kita sebenarnya akan menggunakan proses perniagaan ini untuk menggunakan proses perniagaan yang betul untuk beri sedikit kesejaan bagi perniagaan itu kita akan mempunyai perniagaan ini dan itu bukan cara untuk bercakap tentang perkara, tapi cara untuk bercakap dengan perkara yang kita yang menunggu di perniagaan kita. Baguslah, ia adalah cara untuk membuatkan perniagaan yang lebih seder. Tetapi yakin juga melakukannya lebih rendah. Ia betul yang saya katakan bahawa terdapat lebih banyak masalah terdapat lebih banyak peluang yang kita mahu buat. Kuali perniagaan kek, cara untuk mengandalkan perniagaan ini tidak tahu apa yang lda. Mereka tak bernilai dengan lda. Mereka ingin tahu bahawa ia berkongsi dengan perniagaan perniagaan. Sebenarnya, mereka tidak ingin mengajar itu. Ada banyak penjara di CS untuk mengajar bahawa berbagai kompens. Tapi tidak ada banyak korpus yang mengajar perniagaan perniagaan. Jadi, kami memutuskan untuk membuat perniagaan. Jadi kami membuat perniagaan dengan seorang orang yang benar. Kami akan membuat sesuatu yang ditutupi sebagai pasukan perniagaan. Hanya dengan jauh-jauh. Kami mengambil perniagaan dari satu perniagaan, dari satu perniagaan, dan meminta bagaimana yang tidak berlalu. Kami membuat perniagaan dengan envi alat dan perniagaan. Itu yang tidak berlalu. Inilah perniagaan dengan envi alat dan perniagaan. Kami membuat perniagaan dengan envi alat dan perniagaan. Menurut saya, perniagaan dan perniagaan mempunyai latihan yang sangat baik. Bagi saya, kami mempunyai serban jauh dari perniagaan ditutupi, ataupun berkuali, atau selera, tidak men senin, dan juga sortir masjid dan Sina. Jadi, kami berikan mereka soalan ini. Kita berikan 20 soalan per persen. Untuk 100 orang. Yang baik adalah, mereka adalah orang yang baik. Jadi, kami tahu jika mereka berada di sini, jika mereka berada di sini, mereka akan beri soalan untuk orang yang baik. Baguslah. Soalan ini adalah cara yang tidak berusia. Dan kami berikan mereka keadaan yang datang dari soalan yang lebih gila yang anda mungkin dapat. Dan jadi, ada sebuah hitam yang diberi. Jadi, cara yang kita dapat di sekitar itu, adalah pengalaman kawasan yang terlalu banyak. Maksudnya, menggunakan udara agama lain untuk membutuhkan per configurations. Jadi, kami menggunakanistirkan perpujuan dengan berganda word to back. Dan kami menggunakan keadaan word to back dalam kontak internet yang sangat tunjuk. Yang akan lebih seperti orang yang kami pernah. Kita menggunakan keadaan word to back yang diberi di sebuah hinaan jadaan. Jadi, jika mereka keadaan itu tidak apa-apa, kita menggunakan keadaan word to back dan perhatian diri sendiri. Jadi ini akan menjadi algurum yang paling diberi, dalam keputusan bagaimana kita gunakan Mk. Dan perkara yang baik tentang algurum ini adalah kita boleh meminta adanya tapi bukan hanya 20 orang kita akan memberi mereka 10 juta lagi. Kami akan buat penyanyi. Itu berkontak. Dan kemudian kita lihat sebenarnya orang-orang lebih baik dan lebih berlaku yang tidak bergantung. Maksud saya, ada sesuatu yang mempunyai bahagian-bahagian ini. Dan perangkat-pangkat yang lebih baik untuk perniagaan perniagaan perniagaan. Ia sangat menunggu. Jadi itu menunjukkan sebenarnya bahawa ini sebenarnya sangat mudah. Pertanyaan ketiga akan menjadi model yang tidak dipercaya. Untuk itu, itu akan menjadi tentang perniagaan perniagaan. Jadi kita tidak tahu siapa yang memperkenalkan perniagaan hari ini. Kita tidak tahu sebabnya kita melakukannya. Jika kita melakukannya, kita tidak mempunyai. Kita akan cuba menggunakan bahagian-bahagian yang memperkenalkan perniagaan hari ini. Jadi itu adalah perniagaan perniagaan. Ada beberapa masalah untuk ini. Pertama, perniagaan perniagaan sebenarnya berubah selama masa. Sebenarnya, bagaimanapun, perniagaan perniagaan ini sangat bergantung dengan 15,000. Ia berubah selama masa sekarang. Jadi perniagaan kita akan berubah selama masa. Jadi kita akan memperkenalkan perniagaan yang cukup flasik. Dan, perniagaan perniagaan tidak diperkenalkan. Ia adalah untuk perniagaan perniagaan. Apabila kita memperkenalkan perniagaan, kita perlu tahu apakah perniagaan memperkenalkan perniagaan. Anda tidak tahu pada tahun itu. Anda mungkin tidak tahu tahun setelah itu. Bila kes ini menyedih, mereka bermula pada tahun 2004. Ia berubah pada tahun 2011. Itu 7 tahun. Jadi, kita harus berhati-hati. Untuk perniagaan perniagaan, kita menggunakan perniagaan perniagaan. Maksudnya, kita mengambil 5 tahun yang ini. Dan kita gunakan untuk mengubah itu pada tahun yang berikutnya. Dan kemudian kita akan memperkenalkan itu. Jadi, seterusnya, kita akan gunakan 2-6 tahun. Kali setiap tahun, kita akan gunakan 7 tahun. Kali setiap tahun, kita akan gunakan 8 tahun. Dan sebagainya. Kita mempunyai perniagaan pada tahun 1994. Jadi, kita gunakan perniagaan tahun 1994 pada tahun 2012. Ini adalah perniagaan yang paling baru yang kita mempunyai. Apabila perniagaan memperkenalkan itu, kita akan memperkenalkan itu. Perniagaan ini, tentu saja, kita mempunyai 14 model untuk memperkenalkan itu. Jadi, kita akan membuat perniagaan. Kita akan membuat perniagaan. Jadi, kita akan bercakap sedikit tentang bagaimana kita boleh memperkenalkan apabila kita mempunyai perniagaan. Jadi, kita tidak memperkenalkan terlalu banyak. Biasanya orang membuat perniagaan. Kita mempunyai satu task. Kita membuat perniagaan. Kita melihat t-stats dan apa-apa yang kita perlukan. Kita memperkenalkan itu. Masalah dengan perniagaan menarik adalah kita tidak dapat membuat itu. Kita mempunyai 14 titik. Jadi, kita perlu melihat perniagaan 14. Atau kita perlu mencari 14. Jadi, saya akan mencari perniagaan. Kita akan memperkenalkan perniagaan. Saya akan mencari perniagaan. Ini sangat berkeliaran. Saya akan menunjukkan cara memperkenalkan perniagaan. Dan, ada juga satu setiap setiap sejumlah yang kita dapat menggunakan sebuah sasaran. Sebenarnya, dari kembali dalam 1932. Berkeliaran sangat baik. Saya akan menunjukkan beberapa perkara yang terbaik. Saya menunjukkan cara memperkenalkan perniagaan. Saya akan menunjukkan cara yang terbaik. Saya juga menunjukkan cara yang terbaik. Yang pertama adalah cara yang lebih terbaik untuk membuat cara yang lebih berbeza. Jadi, jika anda tidak tahu, jika anda mencari hal-hal yang tidak terbaik, jika anda mengHYUN dan trembling hampir dan megang dengan aura yang tidak terbaik? Apa itu? Pernama, cara yang lebih berbeza dan senang, perubahan mana ia tidak terbaik? Sebenarnya cara yang lebih berbeza adalah cara yang lebih berbeza. Dan mana setiap perkara yang bermukah sejumlah yang tidak terbaik? Orang lebih berbeza. Sama-sama, dengan cara yang tidak terbaik, ketika anda memperkenalkan perniagaan, cara yang terbaik lain ayam untuk menghubungkannya dan sebagainya. Selepas itu, anda boleh menghubungkannya bersama dengan klas kori. Ia menggunakan statipan, yang menggunakan klas kori. Apabila ia tidak mempunyai implementasi R, ia tidak mempunyai menggunakan klas kori. Tetapi, ia mungkin mempunyai implementasi R, dan mempunyai klas kori. Apabila ia mempunyai menggunakan klas kori, kita akan mempunyai beberapa model yang berbeda. Kita mahu melihat model R terlebih dahulu yang mereka mempunyai sebelum model R. Untuk itu, ia tidak mudah. Tetapi, ia mungkin mempunyai klas kori. Anda dapat mempunyai klas kori, klas kori di sini, dan melihat klas kori, dan anda dapat menggunakan klas kori untuk mempercayai klas kori, apabila satu AUC lebih daripada satu AUC. Apabila anda mempunyai klas kori, anda dapat mempercayai klas kori. Apabila anda mempercayai klas kori, anda dapat mempercayai klas kori. Sekarang, ia memiliki malam tenth barang. Seperti itu, apabila anda mempercayai klas kori, anda dapat mempercayai klas kori. Anda dapat mengambil alih klas kori dan anda boleh mengambil masalah ini. Sistisífican berbeda adalah infosisi terbentuk yang serta nombor untuk memuatkan hal sepanjang tahun menggantikan. Sebenarnya, ini memiliki pasangan 1932, berat ini adalah sebuah jurnal q&a. Dan apa yang dikatakan , anda dapat mempercinalkan klas kori, Apabila anda menghubungi kegerakan, ia berlalu ketika anda mempunyai kegerakan P-valu, anda dapat sesuatu yang lebih kegerakan di kegerakan Chi-Square. Itu mudah untuk bekerja, kan? Semua ini berlaku. Pertanyaan adalah menggantikan peraturan Chi-Square tidak akan terlalu jauh. Kita sebenarnya mempercayai ini, kita dapat mencari bahawa ia pernah terlalu mengerima, Jadi kita mempercayai model kami. Ini berlaku dengan sumber yang berlainan dengan keberangan, dengan beberapa parameter dan detaile yang terdapat dalam kertas. Tapi anda dapat menggunakan sebuah latihan kegerakanan yang kegerakan. So we can say just based on a curious statistics approach one model is indeed better than another. And that's summarized there. You can implement it. Now there's actually some ways to implement it in sci-fi. So you can just use that. Also mathematical and mathematical. So our other issue is going to be observability as I mentioned. We don't know the frauds that happen the year they happen. So we're going to have to be very careful when we back test. When we're doing a back test to figure out say in 2005 to 2009. In 2009, we didn't know who committed fraud for the most part. So we actually have to censor our data so that we don't bias our algorithm. We actually want to train telling it some lies. We say that company didn't have fraud because you didn't know it had fraud. It actually does, right? But because what we need to do right now, we want to say use data from say 2013 to 2018 to tell us who's committed fraud this year, right? Well, we don't know everyone who had a fraud in 2018 so we want to give it a similar sort of data than what we actually have to work with. Otherwise, it's actually not going to work very well. So our solution is going to be censoring and that will end up mimicking what we actually need to do today. The last issue is these infrequent amounts, right? So fraud, so we have say 38,000 fern years, fern years meaning years of data per fern. Out of those, probably 105 of them have fraud, right? That's a bit over 1.6%. That's quite low, right? For typical logistic frameworks, we typically want 10% so we have some issues, right? So we're going to have to be really careful when it comes to this especially when we're doing five-year windows. There are some windows where we have maybe 30, 40 frauds, right? That's really tough to actually justify. So there's some ways we can do it. The first way would be to just use very simple models. Unfortunately, we already know those do not work at all but actually, that's the old financial models of the 1990s. They just don't work at all. The second is to use a generic variable identification strategy and sort of find a very strategic way of removing variables. That's what we're going to mainly do. Or you can use some automated methods, right? You use lasso, you use XGBoost. We have lasso to mention in the paper. XGBoost works great but it's not implemented in paper. As far as how we do this, we start by tossing everything into the model. And then we actually use a QRD composition. So for you guys who took linear algebra that actually is quite relevant sometimes. You don't think you're going to use it but actually on occasion it becomes that, right? So you should use NumPy very easily just to go through that, right? You do a QRD composition. And using that, you can actually figure out the weights that are assigned to each factor for each feature within your data, right? So the higher the weight on its own feature, the more that variable independent from the rest. And then we're essentially going to just kick out the variables that aren't independent enough. So we can at least get a convergent load, right? So at the beginning, we may have to kick some out just because we don't even have enough events to actually more variables and events. So we'll kick those out, right? After that, then we're going to do some other stuff. We have a new graph system solver for the loaded because that converges a bit better than some other ones. So you have to be very careful with the optimization that you use for the loaded. So if one of them simply will not converge even when the load is convergent. Then we have to check out something called quasi-completeness. So it's something you probably haven't heard of, but on occasion, if you run a loaded and you get, say, a standard error of a million on a variable, it means zero and one. That's what quasi-completeness is. So if you have, when you get results and you just look completely ridiculous, that's typically caused by this. It's a case where it's a loaded converge, but it kind of didn't converge. Essentially it's a case where it is technically convergible in some edge case, but the edge case is so far away from any reasonable model that there's no reason to use. And so we're going to keep track of that as well. We're going to keep iterating the model and essentially run a simulation to figure out where it converges in terms of the number of variables in the model and then keep that, right? It's awfully, an office-listed method to actually get at this, but it works quite well. And so I think this is a pretty good method if you want to choose stats method and you don't want to fall back on it. Lasso doesn't perform quite as well as that model. A few final comments. So this model is actually a bit more flexible than I mentioned at the top, right? So it can predict fraud, right? It can predict other things. So for instance, you can say these are not only frauds, but frauds were the management admitted and fraud will become publicly available in the future. We can also look at just 30 types of fraud. So a fraud that's purely financial. They're just covering up the accounting numbers and there's nothing else going on in the background, right? It ends up actually finding more horrible compared to that. We have to see when the U.S. government will also investigate, right? We're going to have to predict that or we can actually even look at cases where the company is not committing fraud. When they're having an accidental mistake in their account. So actually, you can use these types of models to predict quite a bit more than just not a few final notes on ways you could actually do better, right? So one, use a better tokenizer. We did this back in 2014, right? For the actual tax processing new space or something like that. Use down phrases, you'll get a much better result than LDA. Second, you can use some econometric methods for sparsity, i.e. XGBoost, XGBoost will give you probably about 0.05 increase near RLC. You can also try other machine learning algorithms besides LDA. LDA is actually from 2003. It's been a while. There's newer, better things out there. There's no harm in trying those. And there's also other things that aren't LDA based that have come out since these models run around, right? NLT's actually moved a lot in the past 5 years. So, while this is actually, say, most robust fraud protection model in accounting today, there are better ones that can be made. But final though, this paper and all this work wasn't motivated by actually building a better model in the first place. So there's a reason why we don't implement these things, right? Our reason was actually to show that something like that actually helps predict fraud. If you guys are interested in that, you're welcome to, of course, build on our work and implement those things. But for now, we're going to keep it as it is. That's all I have for you guys today. Thanks. If you want to learn more, so the paper is publicly available on SSRN. The slides here are also publicly available at rns.net. You can grab a copy of the slides in PDF. I put it in the slides You guys have any questions? I measure the graphs that show that we catch 28% of fraud or something like that. It's actually a really high false positive because there are very few positive. The thing is, when it comes to governments, the US government looks at 20% of all companies every year anyway. So they're false positive and it's not so much about precisely identifying but more directing them to say among those 20% you look at, you should look at these 5%. It's essentially any frauds factory case, your false positive rates always appear. Can you explain how you get the censoring of the data? That's quite interesting. We actually know the date that the fraud became publicly available. And so because we know that date, we can simply say, is that date in the window that we're using for training? And we say, if it's in the window, we'll keep that as one. We'll say that's a fraud. If it became public, say, January 6th or the following year, we'll say, well, if we were actually the government back in that year, we wouldn't have known. And so we're going to say that's a zero anyway. So it's actually biasin our data in a way that makes it more useful I think that's the model on financial statements So we don't have non-english data but unfortunately, there is Chinese data at least in terms of LDA, it works pretty well for Chinese tax and there's also variants for Chinese tax as a consort has their own implementation. The theory should still hold, right? It doesn't matter what country you're in. The psychology theory just says actually the psychology theory here is that when somebody is lying to somebody else they're very intentional about what they talk about and what they don't talk about. That should hold across cultures as well that should hold across regions and so we'd actually expect this type of model to work pretty well no matter what they talk about. So how do you use an objective measure or do you actually look through the topic? So one of the things that was recommended in one of Blay's publications is to take a look at how an algorithm performs in sample and use that to train the hyperparameters. So we're going to use a 1995-1999 train a model on that for frauds but only in sample and we're going to use, essentially just run LDA models for a crazy amount of hyperparameters, right? And then we're just going to convert to hyperparameters by looping. So it's not really a proper grid search but more of the cutting off some part of the grid because the full details that are included in the paper so if you want to learn more about that if you have a few questions first question is how is your have you checked whether your models will be constructed with a change in power in standards? So for that we don't explicitly track that now I mean you get a study in those that actually helps something when it takes care of most of that anyway so if you're using full 1994-2012 as a train, right? that would be a huge issue but because it's likely at most it's kind of like 5% for a change the main one is Sarbanes-Oxley, like 2004 mostly what that coincides with is that financial is just draft like a rock at that point it's actually would that be a problem if the cluster decides to do inside buffer in buffer such as off-binder nodes Ya, so they could do that now the thing is it's much harder to do that than it is to say keep your financials within a certain map there's a reason why financials start working that's because it's actually quite easy to keep within balance the style variables typically still work about the same as they always have they just don't work that well on channel the thing is, for our model it's a sliding measure every year we're going to retrain the LDL and actually I didn't show this in the presentation I can pull up I have a graph in the paper where we actually show the topics changing over time our topics aren't static so every time we do a new training window we're changing what that target is and so as a manager you have to guess what's the target for next year when you're trying to do it it's actually very hard but it's a specific question there are some lawyers who unfortunately are trying to help clients come and fraud and so they've asked us, how can we get around your measure and we're like, well, we're not going to tell you so we'll cover that if you want to take what you thought that's in the paper, the diagram so currently who is using these models that came down so the US SEC has a similar type of model that US Securities and Exchange question Dr. Mekong, I've worked for the US SEC for a while although part of the SECs are not the SECs but the US SEC actually use a pretty similar type of model for their own detection of fraud nowadays I seem to come into a maze because as an engineer we always talk about cost and effect in the same school we seldom talk about demolitions theory cost and effect but you know the trend because it is certain change in the environment for example Donald Trump become president the whole thing become a mess those are bigger factor to change the corporate behavior or the corporate profit and losses then this might milk the trends over 10 years over 20 years so i was like i'm saying that when you do AI for example in here do you not try to identify cost and effect relations or just making collections of keywords collections of trends so our cost and effect isn't on the economics side our cost and effect is actually more about the management side that's why i say theory for this paper is actually coming from psychology not from economics what we're saying is that the cause is when a manager is creating fraud there is this change in the mindset of how they approach writing and so that's what we're actually thinking and then we use that to then go we're such using reverse causality so we say because a manager is creating fraud they're going to change the way they write they're going to change what we talked about in the document and then we just invert that and say well if there is a correlation this way the fraud so it's not as direct in this type of study because we can't directly observe the manager's actions the only thing we can do is really change the way i totally agree with you it's difficult to model or quantify the manager's behavior but the manager is working within a system the system is a macro system so the thing is when it comes to fraud we say there's usually 3 components it's actually driver fraud the system is one of those 3 components a big factor is actually going to be within the company in terms of corporate culture we don't have that we can't model what a company was like in 1990 at least at scale it's hard to quantify if we have that i'd love to put that in the model because Warren Buffet has that he'll go and talk to management so maybe you should do that it's different at scale we have to do this 38,000 times and we have to invent a time machine to do it back in the 90s we can do it for a couple but even for say even for the usefulness of the model if you have to interact with a company even for a government that may be too much it's too much for them to interact with all companies in Singapore we can flag them and you can go talk to them then they can say now let's look at the information system let's look at the controls let's look at how the people behave we're just doing a first driver flag so the question so the question was can we train is there some meaning to the model we have and can we train a human to the answer is yes and no yes it's interpretable so in our paper we actually show you guys all the topics and it's reasonably coherent as reasonable as we'd expect so if you tell somebody in this year when companies talk about joint ventures but that changes over time joint ventures is probably normal in a bad year joint ventures looks weird so there's actually some the model is not constant over time so you have to keep that in mind but the other issue is that these documents are so actually across all the data we have there's over 3 billion work to read all these documents and so that's really the problem it's just too much for a person to handle that's why machine learning is so great for this type of idea so you guys can actually tell them go read 200 years worth of stuff and they'll do it in 2 years so it's yes and no okay i think i'm out of time for questions if you have any other questions feel free to approach me after the talk or share with you about interpretable machine learning if you have any more questions of course you can ask me after the talk also when you ask questions i'll find you use the microphone to introduce yourself to us but the benefit of the audience is to hear the question