 Hello everybody. So, I'm Ash Henson from SOAS, I'm the host doc. Today I'm going to talk about using network theory which James called graph theory but same thing. Different name to detect Ryan communities in Han Dynasty Chinese. Okay, so basically just to give an introduction and background. And today I'll be talking about the structure of the talk. I'll be talking about, I'll introduce the problem. Go a little bit over previous studies, and then talk a little bit about Bailey's method which is from Julian sitting over here. And then applying his method to the Han Dynasty corpus and talk a little bit about the results. I like to show, I personally fell in love with Chinese characters and it took me from an, I have two degrees in engineering and I switched to doing linguistics because of this always like to show some of the beauty of the characters because it's for me very motivating. This actually is, I drew that myself it was for a paper I had to do on the on lousy. Problem introduction basically there's been a lot of work done on middle Chinese and a lot of work done on old Chinese and Han Dynasty Chinese falls in between these but it's not, it's not just that time wise falls in between it's actually a crux of change between old Chinese and middle Chinese. So why have the Han Dynasty, well, we see here old Chinese, which I just said there's been a lot of work done on it and we see middle Chinese, and then there's this big blank spot in the middle. Well actually also depending your definition of old Chinese some scholars actually include the Han Dynasty as part of Chinese. We go by the like basically Baxter definition so it doesn't include on Chinese. There's one false here right in the middle of these other two and it's interesting note to that, though there's still also a blank spot here the amount of sound chain that happens here is larger than what happens in this later period. Okay, so this is Baxter in his 1992 book list out a bunch of sound changes that happened between old Chinese and middle Chinese. And this is just a small list and these are the actual sound changes that are thought to have happened during the Han Dynasty there's another I believe 18 changes which may or may not have happened during the Han Dynasty. And so these are some of the questions that we would like to answer with this project. And the previous studies. In modern times, actually, Bodman. Well there's a, there was a Ching Dynasty scholar that did some stuff on Han Dynasty. And, but in modern times they have Bodman's shimming study. And the law from law and Joe law actually suggested to Bodman to do that study, probably because he anticipated they were going to study the finals while he had Bodman study the initials. So these were these studies are actually still quoted and when people do my Han Dynasty stuff today. After this you have Coblin did the Sound Gloss study back in 1983 and then Schuster's late hot late Han Chinese reconstructions from 2009. So, kind of a common issue that all these studies have is that they don't take the Han data set in its own right. They kind of interpreted in terms of somebody's old Chinese reconstruction like basically say well you have all Chinese and you have middle Chinese and this Han fits here in the middle. Well the problem with doing it that way is you're bringing in all the old Chinese reconstructions have tons of assumptions and when you do that you're bringing these assumptions along with you into into Han. And you're not looking at the data in its own right so that you're not able to disprove things about old Chinese that might not be true or about about middle Chinese. And then there's also the question of is Han Dynasty Chinese a direct descendant of old Chinese and that may or may not be the case because there's multiple dialects going on. And then they also tend to treat Han as slightly modified middle Chinese. And that's also a huge assumption, given the time distance that I showed earlier. So as James mentioned graph theory I'm calling it network theory. Our studies based on using network theory and network theory is essentially. If you ever if you guys use Facebook and they say hey do you know this person. That's a case of your interaction with network theory. So Facebook has an algorithm which treats people as nodes. And then the relationships between people are edges. So a network is just a way of representing relationships between objects, or in this case, or Facebook cases people in our case we're talking about rhyme words and on dynasty text. So this graph here is showing flights out of Atlanta, Georgia. And so the nodes that are bigger. It just means there's more flights going on between those. So the smaller nodes, like there's not very many flights are not as many flights going these type of nodes. And obviously you're going to appear in the big cities right. So this is just using network theory and applying it to flights. Okay, so Julian. He wrote a paper a few years back about applying network theory to tongue dynasty and song dynasty poems. And he used this to create an annotator which will go in and automatically annotate. So so this is a big problem we've got this thousands upon the fact is his study had 250,000 poems a lot of which have not been annotated and you can imagine if you want to go in and write 250,000 poems talking about many years of time, even if you have a large group of people doing it and then they might be making all kind of mistakes. There's just all kind of problems that can happen. So he came up with a way of automatically annotating these poems. And the way he did that was you would assume a rhyming structure which is basically the last character of every other line rhymes. And then he would use this to this. This is, well, the naive annotator would assume that any character and rhyming position in the same stands of the same poem rhymes. Obviously that's a very naive assumption poems don't always turn out that way. But then he also has other annotators, which he and he would test the annotators against each other so you have the middle Chinese annotator which is based on the So that would say, you know, because we know what words rhymed in the Guanyin right and so you apply this to Tom and song poetry. And then using the, the network theory stuff you also have community detection algorithms which also James talked about. So these community detection algorithms basically they just look at the nodes and which nodes connect to which other nodes and then how many, how often they connect and these kind of things and then they come up with communities. The algorithm says okay I think these guys rhyme and I think these guys rhyme. And so he had, as far as data, like I just said he had 250,000 poems which is millions of lines of poetry. Now we're not as in a, as good of a position for doing on dynasty because the amount of data is just much smaller. So for our project we also have a naive annotator we also have a community annotator, a point where we differ is that we use Schuessler's 2009 late on reconstructions as our what he what he used the Guanyin because there was no rhyme book from the high dynasty that we can use so we had to use something else. And then here's the amount of data which is obviously way less we have to receive poetry 5.4,000 lines. Though I will say this doesn't represent the end there's still other poetry we can bring in we just haven't got around to it yet. We have mirrors which have 44,000 lines but the problem with the mirror, I say problem is it's good data but the problem is, it's very formulaic. There'll be like two characters that just rhyme hundreds of times right. And so what what that essentially means is, we don't have 44,000 lines of unique data. So it's actually much smaller than what it looks like. And then we have 873 lines of steel a data, and we also have bronze data and bamboo data which will be added and later we just haven't added it yet. So I'm going to kind of talk about, you know, the night, the different types of annotators but show kind of real data at the same time so we can save time presentation. This is a character character this is a poem is called the unnamed song by Sima Xiang and so the, the, like I said before the naive annotator just says any character that's in rhyming position rhymes. So in this case, it's Bay and Shui, and a, I can actually use the pointer a. It's just a marker it's just saying that any two characters that are marked a that are in the same stands of rhyme is essentially what that means but there's no meaning to using a it could be any symbol and in fact you will see here we are with the Schusler annotator. So, in addition to having a rhyme marker like a and B we also have the his reconstructions in there so you see that they end and I, but that the main vowels are different. So the Schusler annotator is much more strict than the night, the night annotator is not strict at all anything in rhyme position rhymes Schusler know it has to be exactly the same or with right. So you see here Schusler says no these characters do not rhyme. And then we move on to the community annotator and the community annotator to repeat once again is based upon the nodes and edges and these algorithms going in and figuring out which groups rhyme and according to the algorithms, these characters do rhyme. Now to move on to a bit more interesting data because to rhyming characters isn't all that interesting. We have this chinga our show. So once again we have the naive annotator. And once again, nothing new, nothing unexpected. It's all a they all rhyme, but all the show that these are actually two different groups. So this one stands in this space represents a split between standards and we have another standard. So even though they're all marked a these a is actually aren't guaranteed to rhyme with these days. And then we look at Schusler, and I'm kind of losing the bottom of the slide there's another one down here. And I can't remember if I'm thinking it does it but I don't know how to make the bottom. I think it doesn't rhyme. I think I think it ends also in an eye but there's a different main vowel, but then we move back, and we look at the community annotator and the community communicator agrees with the naive annotator that these all right and these all right. So, when I was mentioned earlier that Julian in his paper also to these different cases where you look at all three agree and then to agree and one disagrees and then the odd man out is different. So we'll look at some of these cases. But before that I'll talk about the kind of data we're using. As I've already mentioned we were you were looking at receive poetry, steal a mirror data, and our receive poetry comes from this book by Lu Qin lead, where he actually collects from from Qin Han Wei Qin and northern Southern and we're only using the Han data from that. And then the sea lay also from the Han in the way. And we're also just using the Han part of that and then same for the mirrors. The mirrors is this huge collection that this Japanese guy. I actually heard on me collected over many, many, many years, and it's kind of interesting story. I won't go into it but he, he basically was donating a bunch of stuff to a museum and they discovered all this data that he had and they're like oh my God, we need to publish this so they end up publishing all this guy's data, which he didn't publish himself for some reason. Now this is, what do we used to call this graph the space ships. This is kind of, I like this graph that shows kind of how this works like in here that a node is actually a kind of a group of characters rather than just single characters, and you can kind of see that the these three without even knowing what the reconstructions are they're all open syllables even in old Chinese and it's good chance they could rhyme. But then this one definitely isn't going to rhyme with these because but then if you look at these two, like, who, who, who, who, they all rhyme in Mandarin, which obviously doesn't mean they rhyme in earlier versions of Chinese. At any rate, it's kind of a, it's kind of a sanity check I like to say this is something in engineering we do all the time to make sure we're not going off the deep end so this was my sanity check where I would check cluster one and 74 of these characters, in the same round group, and then these five are the same these 20 are the same so you can see that we haven't converged on it yet, but, but this is very likely due to the amount of data we put through. So, but but at least the algorithms are getting 80% correct, and then 92 in this room, 91 this group down here. So there is some convergence going on, it hasn't completed but as we add more data to it, then they should converge to a higher degree than we have already. So now I'm going to show some of the graph aspects, and this looks like a mess at first glance. But even in this mess, the very interesting thing is if you look around the edges you can start seeing like little groups that form around these edges. So this is the same data set, but the difference is. So the first data set is the naive annotator so this is the one that assumes all characters in the same the in rhyming position rhyme. And this is run over the combined data so all three data sets. And so what we did is we ran that, and then we ran these algorithms and figure out what communities, the algorithms thought there were and then put show the same graph but the color groups are the other graphs that are supposed to be rhyming. And if you can I don't know if you can see my characters are kind of small, but you can see there's clusters forming these the colors being the clusters. And then here, the same graph I've just circled some of the clusters with where the where the graphs are thought to be rhyming by the community detection algorithms. And then here's the the Schuessler data for the same set now the Schuessler data is obviously much more ordered but this is completely to be expected given the fact that it will only let something it thinks is 100% the same rhyme. And then there's a bunch of singletons around which James does not like, we found out in the previous presentation, but these singletons come from cases where say that character appears in the rhyme, like we actually saw one where there was two characters that both ended in I but had different main vowels, and the community detectors thought it rhymed. So that would they would appear in the same group but in the Schuessler diagram. And if they don't appear in other rhymes they're just kind of stuck there by themselves so these these singletons are due to how strict the Schuessler annotator is. And here's a close up of some of the Schuessler groups and I got his reconstruction printed out so you can you can see that all those characters indeed did rhyme. One thing that was very interesting is up here at the top you have John. And now I saw that I thought oh that doesn't fit. And I thought oh wait a minute is paleo paleographically it's very it's gets mixed up with Dean, all the time. So I went and looked it up in the Guanyun and indeed in the Guanyun it only has an ng ending. It doesn't have an ending. So, Mandarin led me astray there to think that it wasn't part of that group and actually it was. I'm getting into the case studies and this is model on Julian's paper. So I'm once again looking at cases where there's some combination of agreement and disagreement between the three annotators where in is for naive as Schuessler see for community. So this is a poem which I've translated as ancient wuzatsu poem, and wuzatsu it's just a type of poetry, and it wouldn't make sense to translate it in the English. So I didn't. But at any rate, this is a case where all the annotators agree. I don't have the naive annotator listed because we all know it's just going to be a big column of As. So Schuessler agrees and here's Schuessler's reconstruction. And then the community annotator also agrees. And one point that's kind of interesting and it's kind of something that we're hoping the project will answer is that did tones matter in Chinese poetry. And the first time tones are actually mentioned in Chinese literature is after the Han Dynasty, actually I think in the 400s 500s time frame is the first time we know of that Chinese tones are mentioned. And the reason people think they've tones were not mentioned until that time period is because the tones hadn't actually completely become tones at that point. So the fact that these are all sure the same tone is kind of interesting, although, although Schuessler himself would say, maybe that's not a tone maybe that's just has the ending that would later become a tone. Now here's a case where the naive and the community annotators agree but they disagree with Schuessler. And it has this interesting Ryan structure of. Well, if you're looking at Schuessler AA BC, ABA. I was I was just seeing that they're both with this a but then I was like why are they different. It's because the you that it's not a medial you, it's a UA diphthong. So the Schuessler annotator will say it's different. If you look at the old Chinese and middle Chinese for these. You also see that there you have this. Basically a, a, a, b, b, a, b, a structure to it. And it's, it's kind of interesting to think like why, what does this mean, basically does this mean that the Schuessler annotator is correct. That Schuessler annotator is too close to old Chinese. And I actually don't know the answer. It's something I would like to look into and understand better. But even in middle Chinese you see that it's actually a, a, b, b, a, c, a. So there's a kind of a weirdness with this one character, which also looks a little bit. It's not a line for old Chinese, but I'm just looking at the A's the Schuessler divides up his A's. And this actually comes up in these differences here in fact, here we go. So this is a case where Schuessler agrees with the community annotator, but they both disagree. Well obviously because there's a bunch of different rhymes here. So instead of these letters don't mean anything. So the a here for Schuessler does not equal to a here. It's just, they just divide up the stuff differently they divided the groups differently. But this, in this case, the odd man out seems to be this oh here with you. Like Schuessler says that this a in this a or this a are different. Right, but we don't know when the split happened so the split could have happened before the Han or after the Han or sometime in the harm. And if, if that split had not happened. So this thing looks way more like a naive annotator, then it would then it would look otherwise right so this is another thing is Schuessler's splitting of these A's is he correct and saying that late Han have this right and you know we don't know the answer to that but this is this data can will probably end up showing us that. And then this is a case where none of them agree. So it's just a big mess but even if you look at it. The odd one is once again here in the middle with I in because having an ending wouldn't rhyme with anything so once again why why is it like that it could be that this character had a open syllable reading that didn't end in and because in come like there's a theory that an old Chinese you have an R ending and R went to in and also to Jay, so it could be an open syllable or it could be this close syllable and it could be the case that this character had an open syllable reading that we and it would end in I right because Jay would go to I. Seems actually likely that that's what's going on or it could also be that a Han Dynasty poet saw an old Chinese poet rhyme these words and decided it was okay even it didn't make sense in his own dialect. So we started out just using his, but I mean, the obvious question that you're not saying here is that is Han Dynasty rhyming structure the same as later, which is probably isn't exactly the same I would think yes. But yeah, at this point for the received poetry we are using the same annotator for the mirrors and the sea lay and we rhyme every, every single line rhymes. And this comes from the fact that we initially started doing every other line but if you look at the data and it's pretty obvious most of its every line is rhyming changing the annotator something that would come later if at all like if you, because you can imagine, even though I'm saying it's a small amount of data there were still like thousands and thousands of poems right so to go in and actually annotate these by hand would just be a huge project. So it's not really feasible for the current project. Okay, like if I look at this, to me, I would say it's clear, the poet wanted these to rhyme. So there's this big question of what are we annotating for are we saying, are we annotating for in a reconstruction of the language these are not the same battle are we annotating for points intention. For me it brings in a question, the differentiating these two days, at least for this particular column, I mean I don't know if in general that's true but I agree clearly this was meant to