 Felly, yma, ymlaen ni'n cyfweldon ni, fel ymlaen i hyn ymwneud hynny. Felly, ym gyfweldon ni, hefyd, ond ddynion gyfweldon ni, Llywodraeth Llywodraeth, Dypaith yng Nghymru, felly mae'n gwiswyd gyda'r ffrifatigau materio yn gyfweldon yma, ond mae'n mynd i'n blyneddol. Rwy'n gweld yma, maen nhw eisiau gwneud yma, ydych chi'n gweithio'n gweithio'n gweithio'n gweithio'n gweithio'n gweithio. Mae'r ddechrau yn ymdîl yn gweithio'r holl ddau'r bobl yn gwirio'r holl ffons ynghylch neu'r ddau cyfeirio'r holl wedi'i cyfnod y ddechrau. Mae'r ddiwedd o'r Android yn ddim yn gwirio. Mae'n gweithio'r holl wedi'u holl ddau. Mae'n ddau'r holl o'r holl sy'n gyfnod o'r holl ar ôl hwnnw, on Sunday at about half past noon, assuming I'll get the slides written, so that's the advert. But the other end of the project, which is the other end of the day to show that we leave behind, we're going to come from the other direction and look at some social media analytics. And there's going to be a tiny bit of maths, but I guess we haven't all massively been to the bar yet, so this is the time to get the maths in. So assumption, your various social media networks are a little bit tribal. There will be people you know from work, people you know from various social contexts, et cetera, et cetera. So we're going to represent your Twitter network as a graph. And we're going to say, because we're all a little bit tribal, there are going to be nodes on that graph that are more densely connected to each other than everybody else. Assum... I wish I had a nice big pointed stick, but never mind. Assumption the second, if you start somewhere on that graph and take a random walk around it, you are more likely to stay within your given tribe, your given cluster, and you're more likely to stay within the cluster in which you started. Do, for instance, young rewards date to teenage hackers, are they more tribal and have more segregated follower networks than your middle-aged Python dev, we shall see. But to try and find these clusters, these networks, we're going to use a technique called MCL. Here comes the maths. We're going to stick with a really trivial network, two nodes. A node A has a self-loop and B just has a link back to node A. And we're going to draw a matrix to represent that. So one means there's a link, zilch means there isn't, so A links to itself and to B, B links just back to A. We're going to normalise the columns to get the probabilities of hopping around that network. So the columns were sum to one. So if you're in A, 50-50 charts, you'll stay in A or hop over to B. If you're in B, by definition, you must hop back to A. That's all you can do. That's how we've set up our trivial graph. Here comes the science bit. In order to work out the probability of being in a given node at some time, all you have to do, all, going back to high school maths, is square the matrix. So you do it once and multiply the matrix by itself. Well, if you started at A, you've got 50% chance of hopping to B, and then if you're at B, well, it's a certainty you'll hop back to A. So if you're at B, you must be at A the next hop, and you left A in the first place with a probability of 50%. So that's 50%. And another 50%, if you started at A, you've got a 50% chance of hopping away to B. So after two hops, you've got a probability of three quarters of being in A, clear as mud. Getting the quarter for staying at A if you started at B, if you started at A, is left as an exercise to the viewer. So to take it as read, you multiply the matrix by itself, you get the updated probabilities for being in a particular node. Then you cheat or rather you over-dramatise the probabilities. You square them. This time on an element y square. If you square a number that's less than one, which it is because it's a probability, it's going to get smaller. The smaller it is, the smaller it's going to get. So you stretch the probabilities. So after two hops, we've gone 9 tenths to 1 tenth, and B is still 50-50 split. So if you started at A, you've got, we've over-dramatised, and we've said we've got a 90% chance that you'll still be at A, only 1 tenth chance you'll still be at B after two hops. So you keep doing that until things stop changing and you get convergence. So you're going to get a load of zeros and ones in your matrix. And if you look at a given row, and you see which entries in that row are ones, that will be, given that I'm in this particular node now, which of the individual elements are credible starting points. If I'm here now after convergence, where is it credible I started from? And that is going to define your cluster. So let's try it out. My learned colleagues, two of them, they've got about 7,000 once-removed friend-follow relationships and we're going to apply this MCL algorithm to it. And we're going to test it. Now the, I guess what you test it against is the sort of popular open-source network stroke graph analysis program, GEFFI. So GEFFI will have lots of nice layer options for your graphs and there is a particular layer option called open-award, which is supposed to emphasise clustering. And GEFFI has its own clustering algorithm called the LeVain method. So what we'll do, we'll cluster with MCL, cluster with LeVain and see what they look like in GEFFI's layout that emphasises clusterings. LeVain method probably can't see too well, but all the clusters or rather all the nodes that are close together are all pretty much of one colour. So the layout that emphasises clustering has put nodes in the same cluster together. MCL, oh dear, looks like an explosion in a paint shop. Oh, it didn't work. Who's convinced by that? Who have I fooled by going here's two GIFs, two nicer graph visualisations and oh, well fair do's. This one's got all the colours in nice neat clusters, so that method must work. Well, let's have a little sanity check. I ask, why did LeVain do this? Because this is Dino Pony co-owned by my mother to the half, he's probably sitting at the back somewhere. And this is William Bennett. Dino Pony, I think he mostly likes apples. William Bennett likes harsh electronic music and a Talvisco. What on earth did LeVain do putting them in the same cluster? Something's probably a little bit wrong with the LeVain method. Ask the researchers to rate the clusterings for both methods, LeVain and MCL. If you look at the list of people and go, I know who these people are, and I can identify the context that they've got in common, give it top marks. I guess the other classifications aren't really that relevant. Bear in mind that sort of 7,000 next nearest neighbours, so they don't really know very many of the individuals in that extended network. But one in five of the MCL clusters, they'd be like, oh, people I know from this university, people I used to hang out with in or during X. So there's a context. LeVain, not one single cluster, generally pretty random looking. And the nice thing about MCL is that even I can implement it. Now, why did Geffy's method go so wrong? I don't particularly hate on the LeVain method or Geffy because I wrote neither of them. Elephant in the room is that MCL took ten minutes to run on my little box and Geffy clusters that makes the clusterings in a matter of 20 seconds. But LeVain is too clever by half. It tries to slowly build up larger clusters from smaller ones. But given that the degree of Twitter networks is so huge, if you pull in one Twitter account that you shouldn't, you pull in all their friends and followers as well, and if you make a mistake early on, there's no way to break that clustering, you're stuffed. MCL, on the other hand, it's so dumb. It's not even aware of what a cluster is. All it is is updating a table of probabilities and we interpret those as clusters. So I've seen that sort of do the LeVain method or do the informal layout method working quite well for other humanities type data, but when the degree of the network is much, much, much less than that of Twitter. One good example was a graph of which Egyptian scribes had which customers in a particular town. So shout out to the tools, Twython, Celery, Redis, Rabbit, NQ, for acquiring the Twitter data, store it in Neo4j, turn the handle and do the crunching with NumPy, and Geffie did do the pictures, NicePython Library Network X. Any questions or rotten fruit? That probably was somewhat more than 10 minutes. Tony, silence!