 Thank you so this title is a bit of a lie I was just filling out the form on a Thursday afternoon you know I make things up I would be better called some ideas I had about how how you could use machine learning to discover things about digital collections really and yeah funding so using machine learning will be easy to find soft associations that you can't be sure about about all of your items connecting them together or or not so for example this is one thing I've done for Timanga Powell to see if Timanga Powell they fund Maori radio stations to speak Maori and if they don't they need to cut the funding off so they need to at the moment they pay people to listen to the radio and I did this pilot where we've made a machine that listen to the radio and say what language they're speaking I'll do a demo so when it's down there that's the thing that I am English over that side gets a bit confused sometimes and we want the Maori party branch and we general meeting you see the money party branch and journal meeting is being held on the 10th of September 6 6 out of my 6 sound like on Australian it's being held at 38 Richmond Street in Maraenoa you can contact and a detail number where could you all know what if I told you who I could take me to what it's number anyway so it actually works 1500 times real speed so that's that's at real time so you can see it but it could it could go through 1500 hours of audio in one hour and just say probabilistically what language is speaking or whether it's music and so that you it's for for radio funding but that could easily be used for all of your archives if you've got an audio collection or a video collection to just run through and have a go at saying what language speaking and also because it actually distinguishes noise from speech it can just pick out whether this is people at all that's easy and then oh yeah does it contain this is actually an algorithm Google has done and yeah and then there's another example I was going to talk about is looking at papers past and finding connections between amongst the text there but actually something came up now has everyone heard of this book I mean has anyone not any Australians here okay actually also if you if you're live tweeting this just stop for a little bit and do something else with your fingers for the world I mean doesn't really matter oh you can be vague so this book duty politics among other things talks about this popular blog or influential blog which was publishing posts that were written by PR people who were paid by various industries like tobacco and sugar and alcohol and disgraced financiers and they were getting their posts put in this blog to appear as if they were from the blogger when in fact they were being they were paid posts by written by PR people and we know this because of the emails that got hacked everyone is this from the PR person carrot Graham to the blogger handing over a blog post that when the next day that would appear on the blog so there's no kind of for a lot of them there's no doubt but the emails don't cover everything so there's a whole lot of posts on the blog that are quite likely written by carrot Graham or other people that there's no evidence that they're written by the carrot Graham or other people but so the same kind of thing that I was thinking about when I wrote up this when I filled in that form I was thinking about doing this on the papers past but then this this turned up and I thought well just sit on this and it'll be more useful so carrot Graham for instance he wrote this post that starts like one I'm not there oh yeah I already explained that one he wrote a post that starts like this slandering a health researcher because that's good for the product and then and then and then through another channel head you know he wouldn't be slandering the health researcher he'd just be taking advantage of that slander anyway so now if if you analyze this if you just look at the words or the strict the text as is the overwhelming thing and the features are the other content that the names and the you know it's talking about alcohol so to analyze that the first thing I did is looked at the parts of speech using a really made parts of speech a part of speech just means it's a noun or verb it's talking about that kind of thing and it makes a few mistakes the these red ones down here though it's got the wrong thing there we go so if you look at I thought that steel was a verb and what was a noun of verb things like that but overall it gets it against a pretty gets it pretty good and then if you look at the flow of the parts of speech now that is a characteristic to that writing style so and or else you can do it like this so it replaced that content words which are mainly the the the verbs some of the verbs and like is this a verb so not every verb you but and the nouns and especially the proper nouns you replace them with these little tokens which these are Armenian characters strange technical reasons why they are median characters and then look at the flow of this text which if you read the whole thing or if you read that much but you can read pages and pages of it and not knowing what they're talking about so that then when you're looking at this you're not classifying it as a blog about alcohol or anything or even that it's attacking somebody you can sort of maybe see that so then if you classify them using this it's safer then if you classify them using the the raw text which is going to be a circular argument because you if you're thinking that carrot grain wrote all the posts about alcohol and then you find out that all the ones that have the word alcohol and carrot graham by you looking at the word alcohol then you just a circular argument and it's not as you're proving nothing so I look at little sequences count just count up the sequences of engrams and these things and then this is looking at them with a visualization code to yes me that yellow triangles carrot graham posts the proper triangles pointing that way there right so I spend a little I'll see if this point it works what's the bad enough bush right so the here's the key down here there's a cluster of carrot graham posts now Simon Latke is another of these characters who was writing things secretly his ones don't cluster so well in this visualization they're kind of there's a few there and a few there but then they're all over the place and carrot graham has got some over there and some of the and some of the which some of them it could be that the information I've got about that that says that he wrote them is actually wrong then I'm not entirely sure about where it comes from I mean I know where I got it from I don't know some of them are definite and some of them are maybe not so different anyway and then and this is Cassie I just she actually signed her post but that's kind of like a control that's saying that if her ones can cluster together it's showing that this thing works sort of and and there so the most of the gray ones of our cameras later and then these he's got these like sidekicks who this one up here that's the there's a daily trivia section and they all just end up there and and so if I cut out the any post with it would perfectly in the title just it is the same without that so yeah so if you look at this these ones now these underneath of those triangles there's a whole lot of these great posts which are ending up being clustered in the same place using this algorithm and so they are suspicious and then so this is look with labels on them now some of them like that Fiji one that that is something Cameron Slater cares about he writes about Fiji and that is almost certainly him nobody else cares so much and and and this one 2005 now that is before that's this thing started going on but now I'm sticking on the taxpayers tip and are to these targeting kids that's the kind of thing that Cameron Slater wrote about and there's a lot of these ones I don't know if you can the BSC as the building services can't remember what the C stands for as that as was an industry group that promised to pay the work a cleaning group building services a word for cartel that's not cartel anyway that they promise to pay the workers and then so and they got some kind of deal with the Labor government where you had to be a member of this to get work in a government building see and then carrot grain was hired by somebody to break that law so he started attacking the BSC and the the leader the president or the chairman whatever of the of the group probably council is probably the work and Patrick and Patrick Lilo and and there's this is long series of posts just attacking Patrick Lilo for no reason at all you know did nothing that came in that Cameron Slater didn't know anything about him it was just came out of the blue and it's because carrot Graham wanted the law changed and so because some cleaning company wanted the law changed they hired him he attacked this way and then through other channels lobbied the government and and they got the law changed and so that series of things finished so yeah so you get the idea and that's what you could do with paper's past where it doesn't matter so much but you've got all the time in the world to just have things running over it but almost all the letters to the editor were anonymous or my all the all the columns all the all the articles were anonymous pretty much this is from the 1913 you know so there's a big general strike and here's someone purporting to be a worker discussed with the strike but maybe you know if you look at their writing style you might find that they actually wrote the article over here or or you know the they're a member of parliament or something like that and and so there are you wouldn't you wouldn't be able to prove anything but you would get you get a like a web of hints and and it's kind of pretty much for free how much time do I have I feel like I'm going quick yeah I'll do I'll do some questions I'll do the other algorithms so this PS me this is something I was meaning to say it's a good visualization algorithm but it's not really a clustering of it's not the best for actually finding out that the things that were likely to be done by the same person and there are other things I've tried but they don't look as pretty so and nothing is particularly I don't think anything that will stand up and anything will come out of it so it's not it's not really very relevant except that it gives a list of people for people the list of articles for people who are looking into things to to look at so yeah I don't I don't remember what I was thinking of sorry for the false pretenses but that the recurrent neural networks that we're doing the sound thing and which can be used for doing all kinds of stuff with text I as a project to write myself and then look for the looking at the the way or your things I've been using cycle psychic learn which is just a Python library so that that'll do the covers that one yeah there we go so does anybody have any questions we need to use the mic for recording purposes so thank you those really interesting presentation I just was curious like how much to identify somebody's writing style how much text you have to sort of give it to your software whatever to learn the more the better so that these because each post is quite sure there's just not that much evidence in each one to make a definite call but like if I was doing something like if I talk all the books by Dickens nor the books by Robert Louis Stevenson and talk each chapter say and shuffle them all up then I'd be able to and treat of them in this way that which ignored the topic so you wouldn't know is where pirates would it would be able to sort them out pretty definitely because the long you know a thousand words but these are 300 words do you do we have to come back some like let's say I have those like it begins in PDF files but do I need to come back the file format to something no no no it's just users text so thank you but I did that pre-processing we are I was putting it Armenian characters and for the nouns and stuff so it does that stuff itself great talk thank you Douglas I enjoyed it immensely I'm a follow-up question to that one so looking at papers past for example yeah if we then could you get it to learn against itself so it would then you would have each one becomes a control instance and it would check for similarities against every other because we obviously don't know how many the authors are any of that stuff so can you make it self-controlling yeah yeah sort of like well like and that visualization there were clusters about if I turned off the labels so they're all just dots then you would see that there were clusters where some and the names casted in here I was just curious about why you say it wouldn't stand up in court just I mean I don't know the New Zealand system very well I'm from I'm from New Zealand but I work in Australia and I've just moved to Sydney from the Australian National University where we've got quite a large forensic linguistics department and I mean our linguists consult the police all the time and they're always being called as expert witnesses for author identification and that sort of thing so I don't understand why is it your particular algorithm that wouldn't stand up in court or is it the New Zealand system is different it's a combination of possibly just my algorithm and and just the I think there's just not very much evidence in each post because they're all trying to write in the same style as most of the literature on this thing as well like one thing I read about was the Turks and the Armenians arguing over who started it and and because they're all talking on the same topic they don't need to ignore the topic and because they're not I'm not trying to pretend to be each other they're just trying to be anonymous then they've got more freedom and they've and they let themselves go but Karat Grome is trying to sound like Cameron Slater and I'm not I'm not actually a forensic linguist so that I wouldn't stand up in court very but it seems that I mean from the literature at least the ingram approach to stylistics is one of the most accepted approaches so I think I mean unless your algorithm is doing something particularly weird I think it's one that would be appropriate for this sort of consultancy work if somebody if a lawyer or someone was interested in following up on that yeah and in fact the recurrent neural network is much better than the ingrams but no one will believe that yes yeah I really enjoyed that it was great to see the the outcomes of that I just got curious in terms of papers past there's a lot of OCR errors in the text would your approach kind of be invite you know would that matter would it still kind of find matches even though there's there's a certain amount of noise it depends I think how even the noise is if it like everyone's affected the same roughly it'll be fine noise is good if some of the newspapers or you know one say one edition is really dirty and it's just got lots of noise then then they'll probably all cast off together you know because they would look like cabbage so I would kind of look after itself in a way because the one the other ones that look like garbage will just get thrown out as in the garbage pile is there any particular data set you'd like to run this against no no okay yeah I'd quite like to just explore and see what's possible but there's no money in it so I'm not really doing much it's good Douglas another round of applause