 Okay. This is a quick talk about NLP lessons learned or what happens when you jump into an NLP project without really understanding what's going on. So this is a kind of, what went wrong kind of talk, but also has a good result. So my name is Martin. I started with machine intelligence with a PhD, but I went to the dark side in finance. I lived in New York for a long time. I moved to Singapore in 2013. 2014 I decided it was going to be, instead of a finance thing, I would do what I actually want to do, which is the AI thing. Now this has been distracted, of course, by the whole machine learning, deep learning, and natural language processing thing, which is at least more on the path than finance thing. So my intent is the AI, we're not there yet by long means. Since 2015 I've been in serious mode, doing this NLP and deep learning thing for an actual client, and for my sins I've been publishing a few papers along the way, because I wanted to. So the project, this is what we started with. The company essentially takes text, this is the client, the company wants to take text from companies, output a bunch of entities and relationship tuples, and their product, which they sell already to clients, is a complete knowledge graph for, in this case, the Singapore financial markets. So it has all the people, all the companies, their relationships, who's a director, who studied wear, subsidiaries, listings, auditors, a whole bunch of stuff. And at the moment, go through text and they type it in manually. So this is an opportunity to maybe get some productivity gains there. So here's an example of a text they're going to look at. It looked pretty easy when they started out. So this is Dr. Willie, this guy is obviously a person, he's very studied, simple to go out, right? Dr. Lee was appointed a non-executive intimate director, sounds like a relationship of a company, and he's a member of something. So it looks like there are some tuples in here, should be easy enough. So theoretically, what we'll do is we get some sort of Stanford NLP suite in, we do some relationship extraction, we can have a GUI on and just fix it up, and it's going to be great. And well, the criteria really is we want this to be better than humans, which is kind of like a sense that they can benchmark what their outsourced companies are doing, and like a 70% recognition rate, repeatability rate is kind of what they're going for. Now, when it comes down to it, the actual real project is slightly different. Having started, we discover that they actually want us to do this all from the PDFs that the company has produced, so we're not getting clean text. The output, they want entities and relationships, but the entire score is going to be slightly higher, it doesn't sound too bad, 80 versus 70, and they're only caring about relationships because that's actually all that they store. The entities are just a by-product of the relationships between the server, they only care about relationships, that's all you're going to get. And they want this to operate in a turnkey manner, so you press a button, essentially parse this PDF, you get a CSV of the relationships coming out. So what changed? Scoring by relationships only, this is majorly different. Text being in PDFs is a tester, and the standalone thing is probably not so easy, so let's go into a bit about why is this difficult, so scoring. So the scoring of a relationship means you have to find the entities, you then have to disambiguate the entities, they have 100 different relationships, and only if you get all the pieces correct in this tuple, do you score one. And any mistake you make, it will actually be negative against you because f1 score means that if I guess something incorrectly, it's not in their set, they're looking for, I'm going to be penalized for it. So let's just see roughly where we could be in this text. So here, this is without the nice highlighting, it's actually, I can probably pick up the name pretty easily, MBBS, I don't really know what that is, maybe that's abbreviation for a company name, University of Singapore, probably I could get, except it has changed its name, it's now then NUS, National University of Singapore. So that's some, suddenly we're into disambiguation issue, but the studied out relationships probably doable. Dr. Lee, well there's an issue here because Dr. Lee is the same as the previous one, the willy is just some weird Asian thing where they want to give him a Western name. So there's a co-reference issue here, the non-executive independent director actually has two different relationships, it's a non-executive director, it's an independent director, the company name is fair enough. The member here is actually between two different entities, the audit committee and the remuneration committee, which are also different from the company itself. So there's actually a lot more detail when you get to what they're actually trying to extract from what it looks like on its surface as an NLP problem, or knife NLP problem. So back of the envelope, how are we going to be doing with our off the shelf stuff? Finding an entity, suppose on my NER, the named entity recognition, 90% accurate, which is fairly good going. So you've got to guess two of these entities. So on the back of the envelope, you've got to get 0.9 times 0.9. You've then got to disambiguate them. So I've got a 90% success rate of disambiguation in a negative entity, which is pretty good going, but still I've got another 0.9 squared factor here. And then we've got 100 different relationships. Now, if I can recognize a single relationship with a 90% confidence, I'm now competing with 100 different relationships, all wanting to be chosen. So that 0.9 is extremely optimistic, because the more relations you have that you have, the more chance that something is, by mistake, going to have a spurious error. So the problem we have faced on the basic issue is that 90% to the 5 is much less than 80%. So if we just start off doing this as best we can, even making good assumptions about every step of the process, we're going to fail this task horribly. So there's the problem. Let's go through some of the problems here. Apart from the scoring, PDFs. So PDF format is a horrible format. It's basically a bunch of words or even letters with XY coordinates on pages. So there's no structure, or you can't rely on any structure being in there. But when you come to these annual reports, you have, for instance, it is clear from 10 or 15 feet away, or many feet away, that this is two columns of nice little biography sections. But if you actually start to look at the structure of this page or how it's laid out, there aren't really three columns on this page. Different pieces are overlapping in different ways. It actually divides up in a really odd thing. And also the people are mentioned separately from the paragraphs. So there's all sorts of data in here to do with how it's laid out. There's another one. There's tables on this page. There's no lines or anything. This is the board of directors consists of is a line right here. And then there's a list of names with a list of their actual positions. And that's in a table, which you can't see. But apart from that, there isn't any text associated with finding those relationships. So the actual content of the relationships to do the formatting of the page. Same with the hierarchy. A lot of these things, there's beautiful numbering, which as a human, you can see what's going on. But to understand it from kind of a legalistic document, you've got to figure out what's how to make this all work. So how important of it this is? Well, one issue is that any existing PDF kind of library that we've looked at is throwing a lot of this information away. If you just ask for text, it removes most of what you want. And the problem is that layout occurs for about 50% of relationships in these documents. Partly because you will often get, say, a shareholders table, which is 20 points right there for the taking if you can understand it. But if you miss it, you're completely toast on the tests. Another problem with standalone is if the system goes off the rails in kind of early decisions it's making, you've got no user to put it back on track. So one of the key things in these company documents is the company, if you get that wrong, every relationship you produce will be wrong. And initially it was meant to be a helpful user interface kind of human assist thing. The scoring turned it into a standalone product which has been used as CSV. So you never get the chance to revisit with a helping hand. Other issues. So there's a whole bunch of other issues and I just run through these. Academically, good nerve on some of these test suites is like 90% and it's kind of solved. And many things you download off the web, say, I've heard it already, do this library and it will give you all the name of entities. But 90% we know is horribly bad as far as actually doing this in practice. The other problem is that some of these libraries are written for research purposes, not for speed purposes. And it's a very different consideration when you've got all night to run this thing for the paper. But I want to press a button and see it happen. Licensing is another issue. So one of the other milestones which is embedded in this whole thing is sell the system to an external customer. The problem with many of these things, including the standard I mentioned, is that you've got a whole GPL2 issue going on. And so there's a lot of issues making sure you can use these libraries or they may have non-academic, non-commercial uses clauses on corpuses, for instance. Another thing which is peculiar for this, since I come from Singapore, lots more Asian names. A lot of the corpuses that are around are US centric. And there are lots of Asian quirks, particularly Singaporeans, because we're surrounded by places with very, very different criteria for our name. So what you naturally come up with is let's have some specializing there as well. So you may have a domain-specific, for instance, the DSNER, which is often done in corporate documents. It tells you, okay, this is an acronym which I've just defined. You want an acronym defining NERC to take out these things. So then in that case, we're going to have multiple sources of NERC. So in that case, why not have lots of different NERC, because they might do different things. They might be wrong in different ways, but they might be right in different ways. So this also led us to build this thing as a whole collection of services for our kind of arrest, with arrest API. It allows us to then experiment more, we can build a prototype, wrap it in a small API, and throw it into the system, and it will kind of handle it. So that's kind of a structural issue. And then we're talking about users. So one of the things, when you start to increase the number of NERR, which sounds like a good idea, even if they're all 90% good in slightly different ways, they don't overlap totally, the noise increases as well. So all the 10% may be different. And when you're a user, it's very, very easy to scan through a list of names and then spot the fact that it said AGM is the name of a person. And AGM is the annual general meeting, and the whole thing has got confused. But when you say, here's a list of names, and you have this junk NERR at the bottom, for users, they're startled and they think the system is horrible. So to do this in a practical way, you're going to need some magic, because we know that from the naive approach, you're not going to do that. And the magic here is the company's existing data. They've been collecting this data for a couple of years now, and they have very extensive data for who does what to whom in Singapore and Hong Kong in Malaysia. So the nice thing about having a fairly complete graph is you also have very good NERR. So you can then start to auto label existing documents. So one of the things about building a new NERR is having a corpus, which is fully labeled. One thing that you can do with an external NERR that you can find off the shelf, which may be badly licensed, is you can start to label existing text of any description with labels. And then if you can then do a deep-learned NERR off the labels, you can actually learn what these things are doing internally. You can learn how to behave like they behave. So one of the nice things about this approach is then you have an unlicensed encumbered NERR, because it has learned from its teachers. And this is one of the papers I've had. You can actually learn the NERR better than teachers, which is kind of an interesting thing. If you have multiple teachers, you can essentially ensemble their knowledge. And then that means by building your own NERR, you can also incorporate these Asian quirks. Another thing is if this graph has a critical mass, like it has enough entities with enough relationships, you can start to assume that what the fact that you're looking for is probably in the graph, rather than probably just something to find. And so suddenly using that, because it's achieved this mass in enough coverage, you can start to use the database against the data rather than just throwing the data against the database. So one thing, initially, what you can use is that ground truth data can confirm a fact. Initially, you might say, well, I think this guy studied in Singapore. The ground truth data can say, yes, he studied in Singapore University, like you say. But you can also say, well, which Singapore did he study in? Is it the NUS? Is it, you know, Singapore? We've got maybe five different universities of Singapore, slightly different in their names. So the ground truth can then start to suggest back to you how to do no better. But you can also start to use it to proactively suggest stuff you might be finding in the document, which is very, very helpful. So here's another, so this is where it starts to go, is that instead of, because you can now attenuate noise by looking at known data, you can start to guess more or guess entities more random, bounce them against the database, bounce them against the document, bounce them against the web. You can start to find kind of clouds of information about these things, because you've got this ground truth data. And then that means you can be more aggressive in finding and guessing at the different things which might exist. So the net result of this is scores go up. The whole system becomes much more robust because you have multiple fallbacks in case one of these mechanisms doesn't work. You get more data and you get happier. So wrapping up, building these NLP systems is not so simple as it says on the box. The system, as this thing grows, it kind of takes on the life of its own, as do lots of systems. But when it does start to achieve lift off, it's very satisfying, because suddenly this thing will self-confirm all these hypotheses it's making. So that's a very interesting thing. Questions? So did you face any, like, are there any legal repercussions to building a NUR from another NUR, like a licensed learn? Well, that's an interesting legal question. When you're selling it to other people. Sorry? When you're selling the NUR that you build to other people. So essentially what this we're using in existing GPL, in particular the GPL, right? I can't distribute, if I distribute any code, I would have to distribute the whole thing. I understand. Therefore, I cannot distribute code. But is the output of this NUR against just a parse of this thing, which is now a CSV file of sentences and entities, is that owned by the NUR algorithm? I would suggest no. I think this is just... And the fact that you can do this on a billion word corpus fairly easily, is a kind of a giveaway as far as the knowledge inside the NUR. But that's not my problem. Just a minute, please. Martin's also conducting a workshop and he's going to be available at the speaker lounge today. So please feel free to talk to him about the workshop from 2.20 to 3. So he'll be available in the speaker lounge too. And he's also going to be doing a buff along with Shailesh and Sumo than Prasanna on the topic of the buff session being from machine learning to deep learning. So yeah, you guys can go there and talk to them too. Yeah. Hey Martin. Hey Martin. Thanks. Very interesting talk. I wanted to know about the specialized NER that you spoke about. So if you start off with labels. I mean how do you create, I mean so many labels and a training set like that? Because if you think about Stanford core NLP or other libraries which are like working on WordNet and all. So they work on like the Wikipedia corpus. That's pretty huge. But if you're creating something like a specialized NER with custom entities, how would you... So there are a couple of tricks. One of which is that the, I would say that the Wikipedia corpus is only six gig of data. So in that sense from a language point of view it's a fairly, it's not a huge corpus that there's the, there's a web crawl which is 42 gig of data. That's pretty huge. The Wikipedia we can do in a night. So that's about the unit of training time that we did now. There are several different specialized NER that we have kind of like the legal deaths or there are some, there's a very simple one if we call a signpost that Mr, something something, Mr is a huge signpost that a NER is about to occur. And it very rarely gets misled. Whereas some of these other NER will, they're more stochastic in nature and they will pick up lots of other junk as well. So we have some fairly specialized things to get us to some kind of hard harder truths than the most stochastic methods. Stanford has been training for a long time and their NER has accreted or in fact their whole suite has accreted a bunch of graduate code passing through. So it can start, it can be helpful to start with a blank sheet of paper as well. Can you, sorry, can you repeat the question with actual volume? So you mentioned that you used an ensemble of NERs, right? So how did you, what kind of ensembling was that? So there are two levels at which we're ensembling this NER. One of which is to create this new NER, the deep-learned NER. So we're actually ensembling texts, parallel texts created by lots of different external NERs who we can't use. We can then see which, who agrees on what sentences and then essentially use the good sentences. And this is all kind of detailed in the paper. In terms of ensembling NER for the different, radically different methods of doing NER, there we know some, that there's kind of another little deep learning piece where you can select who would be good at this kind of thing at this point. And so this is one of the other issues with the academic work is people try and produce the answer in the very best method. And they will have a hobby horse which is their way of doing it. Whereas in reality it's good to have a whole bunch of different things. You won't have different sources of truth to ensemble together. So, sorry. So in many ways I think it's a kind of interesting project, partly because it kind of, you can make it address all these different things. We didn't want to produce like the monolithic good answer. And in a way one of our models there was IBM Watson where they explained on a monolithic basis they could see their increase in accuracy, kind of asymptotically increasing, you know, a bit by bit. And then there's kind of win some, lose more each week. Whereas if by adding new components you can essentially step-wise either add or throw them away. And so there's the fact that you can switch and change these things is it was a huge win. So I don't particularly want to answer exactly how we're doing this. But it's having a very, very different things and a big collection of facts was a win which we implemented without huge numbers of engineers.