 Rwy'n ddim yn ôl, mae'n oed. Rwy'n ddim yn ôl. Mae'n oed yn ymlaen i'r bwysig. Rwy'n ddim yn ôl i'r bydd ymlaen i'r bwysig, ac mae'n sgwr ffawr gyflym, ac mae'n bwysig yn gwneud y system cyflwyno. Mae'n bwysig wedi'u bwysig o gael bod y mynd i'w gael, yn gael i'r bwysig ymlaen i'u bwysig. Mae'n gael i'r bwysig o'r bwysig ymlaen i'u bwysig. Mae'n bwysig ymlaen i'r bwysig. Mae'r rhan o'r gweithio i ni, mae'r gweithio i Ni i'r Gweithio i ni, oherwydd roi'r gweithio i ni, oherwydd ond mae'n gobeithio i gefnog, oherwydd yng Nghymru, oherwydd yma mae'r rhan o'r gwahanol, oherwydd yna mae'r Llanfer. Mae'r rhan o'r pwysigau o ddeudio, oherwydd o'r gwahanol arall o'r pwysigau, oherwydd yna'r rhan o'r company. A yw'r rhan o'r pwysigau i ni'n lle fydd yn ystafell. So stationery is a huge task, and basically you can sit down in the bottom along the slide and they will just a graph. So they are customers out of law firms or hedge funds underneath the people or regulators themselves that want to dig into companies in a certain supply problem. But this is a manual process and that is not optimal. Particularly since they want to now go to Hong Kong and do other markets My job, or the project is about increasing productivity so that it's not all done by hand. It's a big natural language processing thing. Basically we'll have a machine reading all of these PDFs, extracting the entities and relationships, and then having a kind of a second level analyst saying I'm on the right track, and then the machine can go back and revise the estimates for us. So we've got both 9 and 18 month milestones. The low head count is me, but there should be more. So there's another excellent thing. So we're also building this on Watson-style many experts in that instead of having one big machine learning system, which if you fix one piece, it will get worse at something else that tends to happen. Academically you want to build one know-it-all system, but what the IBM people discovered was if you have a bunch of little experts at the different subjects, you can train them independently and achieve an overall higher score. So that's how it's doing. And this is going to have an internal customer, which is the company, but also for the 18 month thing, an external customer. And there are lots of people who want to read unstructured documents and extract good data. So in building this system, I wanted to be developer friendly because otherwise I'd be shooting myself in the head. I also wanted to be language agnostic because Python is where a lot of the good libraries are. In particular there's a thing called Tiano where I can then start to use GPUs, very good for machine learning stuff. On the other hand, it needs to be kind of interactive. There's used to lots of communication. The node stuff is much better at acing stuff than Python will ever be. I also wanted to move maintainable, so I didn't want to go into esoteric stuff, which I've been playing with. So probably no Go, no Scar, no Haskell here. And in fact, when they have tech people in Thailand, when I reassured them there's going to be JavaScript, there's going to be some Python maybe, they were very happy about that. Also the other thing is there's no need for web scale. I mean, we're not doing anything fancy with databases here, or scaling out. It's very much making a database, which is probably a few gigabytes in size. It's not terrible, I sort of doop your mind. So day one, I thought it was clear from the outset that I'm not the next person, but anyway. So there are a Windows place, they consider Windows Enterprise, so there's a bit of a positioning on day ones, particularly since they haven't actually seen any kind of output from me at the time. So the compromise was to have a big Windows server, which they considered heavy duty. And they can run their MS SQL in that, they can run a Git server in that, perfectly well. They're also running a very nice Microsoft Hyper-V hypervisor, which basically is an entire machine, apart from this other little piece. So anyway, it was presented one way, but the reality is we've got a very large Linux playground for me to play in and everyone's happy. So the component box, the different bits, when you're building a decent size system, I wanted to deal with the configuration, because I wanted that to be consistent across all the different systems. Well, down there we get to microservices, so this is why it turned out this way. There'll be a nice work front end, which should be used by a bunch of analysts. They've got 20 people in Singapore, like 20 people outside of Singapore. Doing, I would say, backbreaking mundane kind of work. Highlighting stuff in Adobe Reader, dragging it into a little database front end and pressing the enter keys is terrible. So I can improve that, that's good. There's a database, I don't want to interact with that database much because it's in Microsoft. So this is why one of the microservices is to essentially abstract that away into just a data source. The microservices is because everything's got to communicate with each other and I also want to be able to essentially let other people build a service and just integrate with it. So we're not all getting in each other's way. Documentation, I hope, is a priority for me because I want to be able to hand this over to someone and then there's a bunch of other as well. Configuration, so I want to be language agnostic. Everything has to be able to read this configuration format. A wide end I use JSON, which would be an obvious one because you have to keep quoting stuff. You need lots of brackets. You default to comment and then there's these stupid comma things which I quite like to just add things to this. Deleting the comma at the end, there it is. YAML is a more Python-esque kind of thing. It's a kind of a white space dependent layout. On the other hand there's nice tooling and in particular there's a node YAML config which means you can do a configuration like this. What's going on here is that this is just one file and this file is actually longer because each service has its own piece. This is saying that under the default thing which the environment is the default of production there's a service called Ground Truth and the Ground Truth server has this and that host and port and various database strings or whatever perhaps is. But then if you're in a production environment that then down below the production thing will then overwrite whatever parameters it wants to so that the production port will actually turn into a 372. There'll be various other things so if I have different environments there'll be a development one where I don't want to keep loading up big datasets or a testing thing which I want to be able to run concurrently. We can fix all that pretty easily. Of course there's a nice Python module for reading this in you can read it wherever it'll turn into a beautifully sane everywhere. So web server I'm not inventing any new stuff here this is Bootstrap, we've picked a nice theme, done. Apologies to all the designers for the work really well. Node, Express and Jade which is kind of nice. Sochio IO, PDF.js. Jade is something essentially coming out of a Ruby style handle thing. It's really convenient. If you're as irritated as I am at commas and closing brackets then you're kind of like the idea of indentation implied structure and there are tools just to turn HTML into nice documents like this it makes editing them almost a pleasure. Sochio IO is another choice whereby I knew that I needed some feedback interactivity with the web pages in particular what I'd like to do is have as results the machine decides on results there should be like a stream of hypotheses and that the user can interact with it. Doing that with jQuery turns into kind of like a hell because jQuery, I mean at the beginning it seems like I'll do a query, I'll get results but then suddenly you start to have different branching possibilities and turn to something really unpleasant. Sochio IO, basically you get a message bus and this shows you can construct a socket on the server side and then just connect these. It's also like Express you can add in middleware to authentication. You can then receive these queries and send back along the socket. Clients can connect and reconnect it's all very flexible and beautiful. PDF.js, this is a thing out of the Mozilla guide and it's how PDFs are displayed in Firefox but you can also embed it in a web page and then manipulate it using controls. You have to fiddle with some other source or whatever but this, I did this fairly early on but it kind of demonstrated that the whole PDF process that they were doing could be replaced by something that was all embedded in a web browser and if you clicked on one side it would highlight in the document and they were blown away by this idea. So that was a win. But also this will be important in that the later thing is nice to know that PDFs are completely understood by this database. So MSSQL, I prefer not to deal with this. So this is a bridge into the other side of the big server. One other thing is that because of that I got a testing regime which is being imposed on me for machine learning purposes in that I have a forced state of the database I get a new document and I then find out what the actual analyst would have said. So I have two snapshots but I also want to control who gets... I don't want to look at my components being able to query the aftersets and you need to keep very good risk control of them. I want to reduce a service which only allowed one view of it and then I can fire off the after states to adjust a verification service. I actually did all this. This stuff is all in JavaScript because I want it to all be nice and concurrent. They could be long queries. It's very performant and it also does some interesting scoring. So basically I take their database and I cache it and I also break it down into an information theoretic score so that if I've got a word for instance I'm searching here for Louis Cockloon who is some person and this is just a rest query and it's giving back a series of names which would be Louis Cockloon happens to score 20 points and the next would be Louis Cockloon who is scoring less and I can actually measure the information content of each name based upon the rarity of each word and all of this by caching the whole thing I can do these very fast scores and it turns into something which I know that they can't do. This is another win, they can't do this with any ease on the Microsoft. Microsoft deserves how I studied but this score is going to be rest APIs. My initial idea was I wanted some zero MQ on nano message because this is high performance for the internal stuff but the request response thing isn't quite there in terms of concurrency for node. It will work if you put lots of different processes on the same spit but it won't do a nice thing like an express will do. So as a fallback I chose a thing called Happy which is for rest APIs so it's kind of a simplified simplified express so none of these are going to be exposed for the outside world, this is an internal thing. I've got stuff for launching these in system D which happens through the box and then I've also got a thing whereby the different services can be placed on different servers but I want to be able to cope with that without hurting myself because I have a development machine and I want them services on the production machine to run, I don't want to have to bring up an entire production machine on my laptop because that will get crazy. So Happy Happy is quite happy making you make a quick server you create a route and then you just handle at her and over some like the response what I wanted to do though is in the main app.js kind of file I wanted to put all of the the route handlers without any of the request stuff because I wanted to divide out the requests and hand them over cleanly to the back end the response thing is fine but I didn't want the response kind of digging through the request in any detail in particular because I want to document this file extensively and make sense to do all of the handling of the request in this before I launch out into the route. System D so this is a Linuxism but it was painful but it's actually pretty nice how to launch this thing so it all starts up it's also beautiful because it will restart these services it actually handles things pretty well compared to I was kind of dreading it but it's actually pretty nice one minor quote because I actually put I actually embellish my command line so that when I look at my process I can see exactly what's running and what's not because otherwise it's just like no, no, no poor man's mesh so on using a host file because of the way it's passed you can just say I want this one to resolve onto my laptop I'll have all the other resolve onto my host or onto my production machine just by for development purposes I can run the heavy duty stuff as long as it's consistent then it's all going to be fine so this is poor man's way of doing it documentation so I wanted to put the documentation in the source because otherwise I would never write it I like Python's Sphinx which is kind of well tested and also kind of easy to use I find kind of the Java docs it's just a hair of use I don't understand whatever it is it doesn't naturally flow from me to this I've added some specific things so I can document APIs well in JavaScript and Python but once you do that this is a very simple documentation basically you can just pretty up your code with this kind of syntax so it understands different kinds of source and response kind of things and then at the end of that you must have seen these Python Sphinx generated pages it all leaves together nicely it looks kind of beautiful but also if you press the other button then you get a latex PDF coming out which is before the management this is a great thick document to hold that satisfies that other so other things I wanted to do is have a kind of message bus so that everything can kind of understand how everything has been processed and NanoMessage works pretty well for that it has solid pops up kind of the kind of the internal version of a socket.io I wanted to do caching because every little service can cache and various other ACs so you want promises promises probably bluebird is a good thing the promises are coming for real and testing which incredibly the company's developers don't have a testing regime so they're amazed that I can do this so this is a NanoMessage this is like a zero MQ like a messaging protocol it's very close to the silicon as opposed to being a big elaborate thing basically it allows you to create a socket which you can publish on and this thing is just like this little person if it's just sending out messages then on the bottom this is a subscriber listening but you're going to have this thing will allow you to have 10,000 subscribers no problem I'm probably going to have 3 subscribers but this is the right thing to do so I'll do that caching so his promises basically with callbacks you can imagine that the first thing you're doing is I suppose I want to read a file and the file is not there then your callback will say okay go and read it now I need to cope with it promises are very nice because I can just fail, resume and then tie back in with my promise thing turns into a very nice solution promises everybody testing basically mocha with I think it's a browser a zombie browser it's working very well for me I'm not doing unit testing since I that would be too much but what I do care about is end to end testing on some of the use cases or from external APIs that is the database didn't break for instance or that the user still can log in so this is very good just for making sure I haven't completely destroyed it I think that's probably about all of the other so there's a lot of choices out there and just spending the time to understand that trade-offs is very well worth while I think but now there's also the real work of doing the actual machine learning so while doing all of this I'll be plugging away of making stuff which works sufficient to keep the company entertained with positive action but there's you also have to make sure the infrastructure is all right because later on they'll say oh good job we did that and now we're on to the real work so questions oh yes I should say we are looking for a keen Singaporean so if there are Singaporeans in the audience we have a particular need for someone who's keen and really different from many places hiring here we don't care about the university thing I care about the passion thing so if you're programming because you're being paid for it that's great we prefer you're programming because you love it but that's a different thing but there is a thing Singaporean thing is a thing apparently I'm allowed to say it so so with your testing setup how rigorous is it do you have tests for every feature that you implement or is it just more of just to make sure that it's like a sanity check somewhere in between once I think a process is well established and works for me and it's like okay let's implement a test just to make sure I don't break it for instance logging in and logging out is absolutely essential you don't want to do something with the cookies which breaks your logging in and logging out or your payment system you've got to make sure so some of the things are fairly short like can I log in or if I go to a page is the login page there but others are like let's follow a whole user story can the user log in look at a document find a word back it out the test will very quickly tell you whether it's doing the right thing and then you know because 95% of the time the tests aren't breaking so you don't want to run lots and lots of miniscule tests if this one big test fails then I know I've got a problem but it's kind of a choice it works for me one other question was you've got happy and express in that usually people go for one or the other I mean was that just like you evolved to happy and realized always that you had a choice or chase an API or what so what was the so express is running on the web server doing it can cope with lots and lots of things it has to cope with lots of different issues happy is almost zero that one page is a completely happy server that I kind of liked so just the efficiency of not having lots of choices to make was also good so it can be good things can you please explain how you make a caching with promises let me just okay so this is the problem with all these code snippets is that they're all like little basically what promise that Bluebird does is it can take an existing library and wrap it all so this will take the file, the FS library and then add an async method for every relevant method it finds and instead of having a core read file with a callback at the end it generates what's called a promise so basically that promise will then on a successful read on this read file async it will go then it will then implement JSON paths and the output of that will have a result called p.then which is at the bottom so the bottom of p.then has a JSON parsed version of whatever is in the file however the way promises work and this is the whole thing there is that the catch which is underneath there will catch all the error paths will go into there and basically this means that either the JSON version failed but more likely the file was not on disk so what this then does is it returns a new promise and the new promise is a load it does a load from the database and it returns the return resolve thing just resolves itself into the results of doing this from the database so the point of it and also async writes the file back to disk I've already got the result, I don't care whether that happens right before I continue so this whole promise mechanism allows you to do all the callback hell to multi branching all in one nice long stream and it's a thing so you thought you don't really do caching you're expecting that the file is there and if it's not you expect it from the database but I've said that you can do database caching so if I request on the database cache I'm doing, I'm kind of assuming that's going to be cached and then failing so so like you can kind of create boxes that's cached but this whole getting ahead around promises in general is something which I haven't yet completely done but it's a good thing because callback hell is a node, is a symptom right so you're going to have problems if you've got on what do you mean that is if a return a resolve empty array that will bite you maybe maybe depending on what load the entity makes but I've done this kind of thing if my database doesn't know what it's doing here I'll give you back a sensible result but then later down the track when you thought the database will never return an error or like you understand all the cases in which it will do that and then you'll have some horrible thing where the disk is like you've got too many father scripters open and the database thing will catch it but you're completely ignoring it so if there is some horrible problem here and we haven't know about it well I'll know because I haven't got any results back from the database sure but you probably wouldn't look here first no no no why my database no no absolutely I I think I am actually working this morning but I took it out for purposes of that I think like squeezing it down Martin you said that you want to do something in the style of whatsapp so why are you not relying on putting behind from whatsapp for free for free right yes if I reinvent the world when these guys have done a lot then you shouldn't drop that there just briefly well one I would be out of a job but also two the company wants to do this themselves they actually believe that they have IP internally and to get the Watson team or to get a consultant to deliver the solution externally up to speed for the same price they could have me and own it and then sell it so that's their thinking yes the IBM Watson is a great thing but IBM's not going to give it away for free forever I don't read them saying that