 Thank you, everyone, for coming to today's lunchtime lecture. And it's my pleasure to introduce our speaker, Julian Bailey. Julian is from France and moved to the United Kingdom to do a master's in NLP at Edinburgh some time ago. And then, more recently, has also done a master's in psychology at SOAS in London, where it happens to be that I was his dissertation supervisor, although I don't know that that may just be because no one else was available. But anyhow, he did a very interesting MA, and this is the research that kind of comes off of that. And then I'll also mention that you may have seen his name on our homepage at the Trinity Center for Agents Studies because he is also, in his professional life, an NLP engineer at Microsoft. And together, I mean, it's kind of 99% him, but we are working together on making minority language auto-completion keyboards. So you may have seen him on our homepage in that context as well. But today, it's about Chinese historical phonology. And in particular, the use of networks as a tool for the study of rhymes. So with that, I will disappear from your view and let Julian do the talking. Thanks, Nathan, for the intro. So I don't need to say more about myself. But yeah, as Nathan said, this is more or less a rework of my MAD station. And it's working progress. There's an article currently under review. And yeah, so I'm going to hopefully make a brief introduction in the field of like a reconstruction of the Chinese phonology. And I will assume that the audience has enough knowledge about it that I can be brief in this domain. So Chinese characters do not explicitly indicate pronunciation. As we know, it's not an alphabetic or syllabic script. And so when we try to reconstruct all the stages of the language, we rely on two different types of information, explicit information and implicit information. So in the range of explicit information, we have a variety of tools at our disposal in no particular order. Rhyme books, which are books that were written to help poets know which character rhyme with each other, but without telling how they're pronounced, just saying, A rhymes with B and with C, et cetera. Rhyme tables, which provide some explicit phonetic information regarding what the categories of rhyme books represent. Fancier spelling, which I work like a rebus. I don't know if you can see my pointer. If you cannot, please let me know, because I'll use it also later. But it tells us, hey, this character is pronounced as a sum of these two characters. So stock and home, we take the first consonant of the first character and the rhyme of the second character. And it tells us that the character is pronounced as sum. There's also some researchers who make use of loan words and two languages for which the script is phonetic. So if we know a word has been borrowed into another language and we know how that word is pronounced in that language, that gives us some information. And finally, last but not least, reflexes of these characters in modern dialect. And in the field of implicit information, we have the graphical structure of characters that tells us if this character contains this component, it's likely to be pronounced similarly to other characters that have this component. And finally, which is the topic of our investigation today, rhyme. If two things rhyme, then they are probably pronounced somewhat similarly. We are very thankful in Chinese studies to have a very wide range of rhyming material starting like almost three million years ago with broad descriptions and, of course, the book of odds. And we have extremely large corpora of recorded poetry from the song and the song, hundreds of thousands of poems, and that represents millions of lines. And of course, with a lot of this material rhyming or supposedly rhyming, that gives us millions of examples of things that do rhyme together. There is currently no comprehensive annotation of these corpus due to its size. I do not believe that we can necessarily annotate it manually by a small team of people, at least. And so my research was aiming at finding how we can annotate these corpus. The goal of annotating these corpus is, of course, to provide an ensemble of material where we can, as a collective academic community, say, we know these things rhyme. What kind of analysis can we derive from it? So annotating these corpus and having collectively shared is the first step to the analysis. Of course, there have been plenty of people who did analysis themselves. But as far as I'm aware, they have not published the corpus on which they are relying. So this research is trying to address this. Brief bibliography as the main inspiration for this research was the last paper we see here, which is using network models to analyze all Chinese data by Johann Matthys List. And from the rest of the bibliography, you can see that I took a lot of inspiration from the work of Dr. List and a few others, including Nathan and, of course, Chris Foster, who described why we need to annotate, made tools to annotate manually. And here we're trying to address. Let's do it automatically. So automatic annotation. I'd like to introduce some concepts. I think the easiest type of annotator is what I would call a set annotator. So what I mean by a set annotator is we take the list of all Chinese characters that exist. And we try to group them into rhyming sets. So if two characters belong to a set, they rhyme. It's a very simple concept. I have not seen it being named before. But it's easy to understand as an annotator if what you do is you go through a poem and you say, right, I've seen character A. Which set does it belong to? And now I see character B. Does it belong to the same set? Then it rhymes or it doesn't rhyme. So it's easy to understand. It's easy to inspect. And it's easy to visualize. The main problem with these kind of set annotators is they don't really take into account the context. So as I just explained, you say, are A and B in the same set? But now if I see a poem which has, let's say, 16 characters that are in rhyming position and 15 of them rhyme, we can suspect that the last one would rhyme as well. But a set annotator makes, has no knowledge of this. So there are natural examples of set annotators or at least of material that we can use for set annotation. So rhyme books, rhyme books are the example of set annotator by excellence because it groups all the characters that rhyme together. So if you take a rhyme book and you just take the time to encode it into sets, then you can automatically annotate the entire corpus. In this presentation, I rely a lot on the Guangyun rhyme book as a convenient and often cited rhyme book for annotation. You could use reconstructions. So people who have taken the time to make reconstructions, you could use this reconstruction to say, well, these things rhyme, so I'm going to build a set with it. And of course, without going to the land of reconstruction, you could just take any manually annotated corpus and say, someone has already annotated this corpus and said that this character rhymes with that character. Therefore, I'm going to encode this knowledge into my annotator. Right? So going to the taking inspiration from the paper that takes rhyme networks and rhyme communities to build, well, to make some analysis of all Chinese rhymes, I decided to use these to make a set annotator. So I'm going to explain how to build a graph out of a list of Chinese poems and then how we derive a set annotator from it. So here we have three poems. So they're all ranging more or less from like the eighth and ninth century. I don't have the date of the poems for any of the poems that I just have the date of the authors. We can use it as an approximation. We assume that poets write poems when they're alive. And so here are the three poems and I'll explain a bit the structure because we're going to use this kind of table all the time. So on the left column, we see the poem. In some of my tables that we'll have one line per line and in some others, for consent of space, I will have the two lines of a couplet on a single line. So the rhyme column shows like the last character. So the character that's in the rhyming position of each line and MC stands for middle Chinese and that's the reconstructed pronunciation according to the Baxter and Saigar system. So that's where I quote it from. So here we can see we have a poem where every second, every even line, even numbered line as something that seems to rhyme, something like in air, air and then who and who. And sometimes the first line of a quatrain also rhymes. And here I'm not sure if the ta is supposed to reign with air and quatrain. Maybe it's the intent of the poet but it doesn't really matter whether it does or not here and we get the same for the two other poems. So every even numbered line has something that seems to rhyme. So a, a, a and the third poem has like three rhymes. So a, a, a and from these we can build a graph. So if we go through the first poem and we take the character a and then the one pronounced k we make a node for each of these characters and we draw a line to say they've occurred in a poem in which they're considered to rhyme. So the line between the nodes represent the concept of each rhymed in a poem. We do the same for the next three characters. So these three characters rhyme together. So ku, ku and tu. And because they all rhyme with each other we have the three nodes and we make a line between each of them, right? We continue k and hui. So we've already seen k before, it's here. So when we add a node and a line between k and hui we do not link hui with a because we have not seen a poem that shows them rhyming together. We just know that both of them in separate poem have rhymed with the same character, but that's it. And of course, the last poem has three characters that rhyme together, so we get this triangle. Right, so this is the basic concept. When things rhyme in a poem we just link all of them together. When they rhyme, but in different poems there's no reason to link them together. So now we take this process but we applied to roughly 250,000 poems taken from the Chen Tang Shi so the complete shi poetry of the Tang and same for the Song. So you can see these are fairly sizeable corpora which I argue we don't really want to annotate by hand. And I make two simplifying assumptions. We could consider them perhaps even bad. We consider that everything that's on even numbered line within a poem rhymes. And we also consider that everything if it's in a rhyming position in a given poem it's rhyming with each other. So it's a simplifying assumption and if we go back to the previous poem we can see it's even wrong because here the assumption I make in this automated computation is that all these four characters rhyme. Here they don't and it's the same here. We see that don't rhyme with ku, ku, too. But we're going to see that the technique we use after on the total graph makes it an acceptable assumption. So we do this process on the 250,000 poems and we get this huge illegible graph. In fact, there's a bit more in the southwest of this graph but I thought I would make it slightly more legible by zooming on this area. And well, what can we say from it? If we look a bit closer to this graph we can see, so the way this graph is printed is if characters often rhyme together in the entire corpus they are attracted to each other and if they don't rhyme often with each other they are trying to push away from each other. So when we see a lot of character that seem to detach themselves from the rest but at the same time form a tightly knit community these often rhyme in the corpus together. And if we just briefly look we can see that there seem to be, I think a knowledge of Mandarin pronunciation is enough there but you can see that characters maybe Mandarin are pronounced with 云, 云, 云, 云 or 云, 云 or maybe even some 云. Well, everything that's in middle Chinese is like 云 or 云 seem to often rhyme together and we see the same thing with character that are rhyming in u so 估, 估, some u as well, 估, et cetera and we see the same with 云, 云 and a group that seems to be 澳. And so the principle of the rhyming community so I will pass on the description of community section but the idea is that we look at a graph like this and we say if characters are grouped together in a way where they have more links between each other than with outside of this group then they form a community. So here if we look at the 云 on group they always rhyme, they very often rhyme together but less often with the rest of the corpus. And so that's the definition of a community and if we go back to the concept of set annotator what happens is we can say, well if things are in a community then that's a set. The community is the set and so we can start annotating our points by saying these things did rhyme in the corpus so we consider it's a community so now we can annotate the corpus. So I'm going to use a bit of color now so that we can see each color represents a community. Unfortunately there's only a very finite number of colors and so here the 云 and 云 are actually let's say two different blues and they represent different community. And so we see the algorithm identified the fact that the 云                                                                                     . belong to different group and so you can use this color to say this is the set annotator. And by the way if we go back to here I had kind of identified how an A was being a community because they seem to be detached nicely but we see that the algorithm actually found them although they are very close to each other and therefore they are like close community and that means that they often rhyme together but not too often. So the algorithm says these are two different community one is the 云 community and one is the 云 community. And I should probably note here that when I call them 云                      etc. I'm relying on the knowledge that of existing reconstructions. Of course if you were to come to do to investigate the corpus yourself at first you would not know these reconstructions and so you would have to come up with your own label. We do not a priori know that these communities correspond to a specific label. Right, so I hope this little introduction was clean enough and hopefully I didn't take too much time and we are going to look at a few case studies of like once we have built an annotator like this how does it behave? What can we do with it? Are there limitations? So the first case I have is a fairly trivial example. So we have a poem from the 11th century and as a lot of poetry what happens is there's a single rhyme the entire poem. So here I've added the middle Chinese reconstruction for all of these characters and we can see quack, lat, mat, mat, jack, et cetera. So all of these seem to rhyme because they end in act. If we look at the rhyme book, the Guang Yun we see that all of these characters that are in the rhyme column are listed as rhyming in the Guang Yun. So that's good. We have a kind of confirmation that it seems to work. Our system as annotated. So here the type of annotation we see when it says A, A, A, A it means the character after the annotation is part of a rhyme and this rhyme is A. So the fact that all of them say A means all of them rhyme together. If we looked at the poems earlier we would have had some A, A and then some B, B to mean these two character rhyme together and those two character rhyme together but the two different groups don't rhyme with each other. So that's a fairly trivial example and of course there's not much merit. Perhaps it may just be by chance that we said well everything rhymes and it just happened to be so. So we can look, oh yeah. So we can see here this is a little graph that shows it takes all, sorry, sorry. It's a subset of this graph but looking only at characters that ever came into contact with one of these characters. So any poem that has one of these character we look at what are the other characters that rhyme with it and we end up with this graph. So we clearly see the community being like this joined further from the rest and so it seems to form a community of its own and of course there were a lot of poems where there were other characters but that makes an undistinguishable mess. So anyway, to convince ourselves that the system is working and that I didn't just write a piece of code that says everything rhymes all the time which if you remember was one of the simplifying assumptions we make when we train the model. Let's look at the poem where the rhyme changes every quaternion. So here we have, on each line we have a couplet. So two lines actually but trying to take less space. So two lines correspond to a quaternion and if we look at the rhyming characters we see that it seems to roughly rhyme every two line together. So eep, eep, on, n, in, in, oom, oom, and then oom. So of course some of these may feel like they don't particularly rhyme but I'm going to ignore this. It seems clear from the intent of the poet that it was meant to be A, A, B, B, C, C, D, D, E, E. And whoop, so I take this just as an example to show that the community annotator is able to detect this kind of pattern. So even though at first we taught the training during training everything within a poem rhymes the system is able to learn that in fact not everything in a poem rhymes and here when it's presented with a poem like this it's able to say here not everything rhymes together we have this pattern of A, A, B, B, C, C, D, D, E. And by the way, the other simplifying assumption was I'm only looking at even numbered lines so this A, B, C, D, and E that we see here as a way to make things simpler for myself I've completely ignored them and I think it doesn't really matter for the purpose of this talk. Right, so now we've seen that it works and can we make more interesting things with it because here, well, yeah, let's see a more interesting case. So let's look at when another annotator disagrees with the community annotator. So for instance, we know that poets try or were incited to use the rhyme books to choose the characters to use in their poem for rhyme but in practice, poets greatly divert from this ideal of using the rhyme book. So here we have a poem in which the community annotator said everything rhymes together but if we had looked into a rhyme book it would say, well, actually most of these things don't rhyme with each other and if we look at this poem in particular it would say there are eight different rhymes that appear. So we have this kind of discrepancy between the very prescriptive rhyme book says these doesn't rhyme and the fact that our community annotator said based on the entire corpus it looks like these things rhyme together. And if we take the characters that we see here and we look at the graph, the subgraph of what we've seen earlier, these is a graph, sorry. The fact that it says here means the community annotator considers that these characters in the second column all belong to the same set or we can say in the same community. Now if we plot this community and here the collars represent which Wang Yun rhyme it is we can see that we have something that is very tightly lit. So like if I remove all these collars and make them all blue you would not notice that there are different groups they are too closely stuck together. We can kind of see that there's a split between West and East on this and this corresponds to the tone split. So even though all of these characters were largely used interchangeably as rhyming we can still kind of see that we have the rising tone on the left and then the departing tone on the right. So of course it was common practice to have inter rhyming between those groups but it was not perfect. So this graph it is the result this shows the community on the entire tongue and song corpus. So we are spanning 600 years of poetry and 80% of that poetry is from the song. Now if we build, if we go back to when we said we're going to build an annotator for the entire tongue and song and instead we say let's bring an annotator for just the tongue what we see is a very different picture. So what happens is with an annotator that is trained only on tongue poems so earlier poems, it annotates the poem as having five different communities instead of just one. And of course we said the Guangyun considers that eight communities. And here we can see as opposed to the previous graph like these communities are fairly separate. Of course there is some contact with them like every gray little link shows inter community rhymes but overall they're very distinct. So this could be an avenue for us to see to track the acronic change in rhymes by saying if we build an annotator which is empirical on an early corpus and then we build another annotator on the later corpus and we annotate a single poem and they annotate the poem differently then that must mean that perhaps there was a change in rhyming practice and perhaps that change in rhyming practice is due to a change in phonology. So we'll come back to this a bit later but let's go with the fourth case study which maybe shows the limits of the community annotator. So here we have a very simple poem with four rhyming characters and here I didn't go with Baxter and Sagar but I've used the reconstruction by Edwin Pooley-Blank of late middle Chinese. So here of course we have a poem that's 13th century so it's actually far beyond the stage of late middle Chinese which would be I think in Pooley-Blank's view something like seventh or eighth century. And this 13th century maybe corresponds more closely to early Mandarin but for the purpose of illustration this is fine to use LMC because of course our corpus is more centered on the tongue and song and here we are really at the end of the song. So we have four characters that according to LMC seem to rhyme fairly well so kwa, hi, hi and pi with a long vowel and an offset j glide but our annotator considered that the tiya character or the kyi character does not rhyme with the other and if we look at the poem of course there seems to be very little doubt that the intent of the poem is to make this third character rhyme with the rest. So why does the annotator disagree with it? So the intent is clear and if we look at rhyme books we can see that this character, the kyi, is listed as belonging to the so-called kyi rhyme that is to say itself, it's just a coincidence here both in the Guangyun and in the Ministry of Rights rhyme book which is a bit later and it will become relevant a bit later and so that's for the rhyme books so it seems the poet agree well took things from the rhyme book and however our community annotator says kyi so this character actually rhymes with the may rhyme so why is that? If we print a graph of just these two rhymes so I've called it in blue the may rhyme and in orange the kyi rhyme we can see that although they are fairly distinct there's a lot of contact that seems to happen between the two in the middle so some communities although they are distinct can be more or less close to each other and if we look a bit closer to the interface of this graph what we see is first that the kyi is very much in the boundary there so it means it has a lot of contact with the kyi with its own rhyme group but it still sits quite nicely in the may rhyme and if you have sharp enough eyes you may notice that the characters at the interface between these two groups they all have this way the same phonetic component so we see yeah I don't know this one yeah what some was well what yeah et cetera and so it may be interesting to look at these and of course it's interesting that they all share the same phonetic components so if we look at these characters in a rhyme book like the Ministry of Rights what we see is they are all listed as belonging to both rhymes so it's kind of the aha moment this explains kind of why we see these two committees being very close is that the rhyme books themselves say that these a range of these characters that we saw there in the sorry that we saw in the middle they're actually listed in in rhyme books as belonging to both groups and if we look at Pooley Blank's reconstruction of late middle Chinese for these characters that most often rhyme with kyi we see that a lot of them rhyme with in ah and some of them rhyming yai or just I but the majority of the characters that are found in the corpus as rhyming with kyi actually are in ah which suggests that this character kind of had already lost its jai glide at the end and of course in modern Mandarin it's yeah so the glide has disappeared right so what can we conclude of this from the poem it's clear the intent of the poet is to say these rhymes the poet is justified in that all of the rhyme books seem to agree that this character has a jai glide and rhymes with the kyi, hai and bai that we've seen but in practice if we look at all of the poems of the tang and the song at least in the genre we see that 78% of the time it does not rhyme with characters that have this jai glide but it rhymes with characters that have this either ah or eh rhyme and I found it interesting to note that these 78% or at least a very large majority of it was already the case in the tang so in fact it was even worse during the tang there's only nine occurrence of the character in rhyming position but eight out of these nine it was not rhyming with the ye at the end but with a simple ah and geography does not seem to play a role either so in modern Cantonese and modern Minnan these characters so the character still has a jai glide but if we look at all the poems of the tang and the song it doesn't seem to be there doesn't seem to be a correlation between geography and the use of these characters the rhyme so the annotator is wrong clearly about this poem because it should have annotated as eh eh but it seems that the poet's choice is a bit weird or maybe we could take conservative it kind of obeys the rhyme book but nobody spoke like this anymore and indeed the character had already lost these jai glides so anyone who would have read the the character the poem in the current pronunciation of the time would have found that it didn't really rhyme so that's an interesting eh beat I think is that the annotator can be wrong because it didn't take into account context but usually when it is wrong it is wrong with a good reason and maybe a human would also consider that these two things should not have been picked to rhyme right so it says recap but it's not the end a little recap of what are the advantages of this of this community annotator it works on non-annotated data so if you remember we started by taking the corpus and just assuming that everything rhymed so we didn't need to go and annotate anything we just said we are going to assume everything rhyme at first and then get the community annotator to kind of disentangle all of these and decide what rhymes with what doesn't rhyme as a result it requires very little expert knowledge and I think I'm a proof of this here so I discovered a lot of things just by using the tool and exploring it's empirical so it means even if you didn't have any rhyme book just by taking your corpus that you have at hand you can train it to then make these kind of discoveries I've worked on an accuracy metric for this and I've annotated manually a few hundred poems of various types and I found 98% accuracy in the annotation we'll see that there's a bit of caveat is that the genre of poetry is very easy to annotate because it has a tendency of having everything rhyming in a single poem which is why my simplifying assumption earlier is actually okay for this type of poem if we were working with something like tz or all the poetry it may be not a very good idea and it makes it very easy to highlight odd poems so these poems out of a corpus of a quarter million of poems being able to find this interesting example was very easy and I didn't have to see through hundreds and hundreds of poems or thousands and of course as we sit in case 34 the problem is that as a set annotator it misses the intent sometimes but at the same time the fact that it missed the intent in the last case study kind of gave us some interesting discussion right so earlier in case 33 I mentioned that if we looked at an annotator trained on the tongue and an annotator trained on the song we could see if the annotation was different for a given poem it could mean that there was a phonological change right and we seem to have seen rhyme merge and rhyme split so rhyme merge was in case three where we said something that used to be annotated as five rhymes in the tongue is now annotated as a single rhyme in the song and we've seen a sort of sort of split with the case 34 where we say something that used to be rhyming actually became well should have been considered not rhyming anymore in the 13th century so can we date these changes and can we detect them automatically and I propose a strategy for it and it's to say since we know in this corpus the date of birth and death of the poet we can kind of assign a creation date to a poem by saying roughly midlife of the poet and now that we are able to say for each poem that's when it was composed we can have a sliding window where we say I'm going to train an an annotator every 50 years by taking the poems that are 50 years apart from that point and so only these poems so I train an annotator based on the poems composed between 600 and 700 an annotator between 650 and 750 and etc etc so I as we move we have annotators that are empirically specialized to a given period to a given century we can go more less granular you know like the details here are up for for debate but a window that is 100 year wide and that moves by 50 year step seems to be doing okay and if we go back to the poem of case 33 so this is what we had this was with having an annotator for the tongue and an annotator for the song and we said okay in one case we have five communities one color per community and in the second case we have one community where I've used a color to show all of the different rhymes according to the right book but if we colored by community this would be a single color so now we annotate this very poem with the sliding window annotator and what we get is that what I'm plotting is how many rhymes are there in the poem based on an annotator in 650 700 750 etc and what we see is that at the beginning of the tongue this poem would have been considered to have six different rhymes and so it would not have been a very good poem because I the poem was was composed here and so it with the intent from the poet that everything rhymes but if someone from the tongue read this poem it would have said this doesn't rhyme at all there are six different categories here there seems to be a little glitch I which I cannot explain and once we reach the interregnum between the tongue and the song so roughly 10th century we see that successive annotators consider less and less that these are different categories and the number of distinct annotations fall to one which is what seems to be correct that is to say everything in that poem rhymes and once we reach the end of this the song there's an interesting phenomenon where the number of rhyming categories seem to climb again to three and what this means is here we witness a merge so we used to have the six categories and then we have one so that's a succession of merges of course exactly when each of them happen we don't know but we and maybe that already happened in in the common speech but at least in poetry there's a clear mark where we say from 900 onwards and in a window that seems to span 130 years all of these characters that were pronounced fairly differently now have the same rhyme and what we see here when the number increase again is things that in the 12th century all rhymes now they cease to rhyme and so we can have a deep dive into why that might be the case so that's a lot of information on these but I show kind of the result of an annotator in 860 1100 and 1300 so if we go back to here that correspond to before the merge after the merge and after the final split and what we can see if we look at the column that's called community 860 and the one that's called gwangyun annotator is that in the tongue there was a very strong agreement between the community annotator trained on that period trained on 9th century and the gwangyun so there are a few discrepancies but overall we see a a c c a a etc so that means that the behavior for this rhyme at least of poets in the 9th century was still very close to the gwangyun the fact that here uh pan as a b is because it only appears like three or four times in rhyming position and so there was not enough data to retrain it properly and so we have a misclassification once we move to 1100 we see that everything rhymes and if we look at the late middle chinese reconstruction by the way late middle chinese it may be a bit early for 1100 but anyway we see that pulley blank reconstructs all of them as having an a rhyme which was not the case you know the middle chinese where we had some like aya etc so this explains why we got a merge we went from having six categories here so if we come to different letters a b c d e f so that's six and in 1100 we only got one letter which means everything rhymed in the poem and finally if we look at what happened in 1300 and for illustration i should have early mentoring reconstruction by pulley blank we see that we have three letters now a b and c and that these correspond of course to a different vowel except for the pan again i don't know what's going on with this one but pulley blank explains it as saying that characters for which there was a yi or yu glide so qian yen mien the a was fronted and became an a so all of these that we see i call them in blue and they all have a a the one that have a u glide so huan and huan uh the a became uh even more back and became a o which i've noted in in reread and finally the ones which had a long vowel so qian and here qian um he argues that the um the length of the vowel prevented it from being fronted or back and so they remain with an a and that's how we get um this kind of uh what we see in this graph here with first we go from six we fall at one and then we get back to having three different things of course um here i've discovered this poem and this uh interesting phenomenon and then i try to figure out why and thankfully um edwin pulley blank had already kind of explained this phenomenon but it's interesting that i was able to discover this uh thing as well with uh much less specialist knowledge uh and less time as well so that's how you can discover rhyme change um that's more or less the end of this presentation uh it's still a work in progress um i have an article in the uh review uh for this uh the annotation of the entire corpus as uh described has been published uh here with this doi if you're interested in it um we have an annotation accuracy metric and it seems to behave very well so 98 accuracy on the full tang and song poetry uh and if we exclude the poems in which uh the pattern is fairly trivial uh we get 84 percent accuracy which is which is lower of course than 98 but it's much better than what you would get if you had an annotation done by a rhyme book there's more work needed um here uh poetry is rather easy because it tends to be a lot of so single rhyme per poem but if you go to use a corpus like uh so for instance there's a very big collection of poetry from the song or if you go of course with uh earlier corpus than the tongue uh it might be more difficult and you might have to maybe bootstrap by annotating by hand a bit and then running the algorithm etc but i think the algorithm would provide a good speed up and an area of particular interest for me is of course um i've shown how we can highlight rhyme merges and rhyme splits but can we automatically discover them to bring all of the poems that are relevant for analysis and refine our knowledge of middle chinese evolution into early mentoring and of course this can be done synchronically for these poems we know where the authors were from and so if we approximate their origin with their dialect um we could perhaps see uh some merges spreading from northwest china to southeast or these kind of things right uh thanks um i'm happy to take any questions or even more suggestions for the work and if you want to get in touch with my email address here uh maybe you can put your your presentation away or i don't know what yeah i can i don't know yeah maybe questions will be about us oh yeah yeah so maybe it's good to leave up um uh let me just say to everyone that uh i mean i i don't know you can put things in the chat if you want but also if you feel comfortable you can just unmute yourself and show your face and and say something uh to to get things started i'll just uh ask a simple question which is you were saying that the serb poems are are um more complicated right but it seems like actually then you've done things let's say it be precisely because the sheer poems are simpler uh uh annotator trained on the serb poems is likely to do very well on the sorry an annotator trained on sheer poems is likely to do a very good job on serb poems yeah that's right yeah okay yeah it's supposed to work by period i would say so yeah we could use these uh trained annotator to annotate the serb corpus that is true uh but um out of this case if you said you want to annotate uh i don't know uh preach in poetry then it might not work uh as well yeah uh okay so if someone wants to uh i don't know if someone wants to jump in and ask a question please go ahead i can't you know tell who who you are let me uh this is where it would be nice if you know if we were in in in if i could see everyone because then i would say oh you look like you have a question but i can't do it so um if i could hi chris foster here can you hear me yeah hi sorry um i can't unmute my uh my video for the moment um but maybe if if if i could just uh ask a very simple question here for someone who's not as technologically savvy um the role of those two initial assumptions that you built in to initially train the annotator um you know rhyming on every even line and that all words rhyme with one another um can you maybe just explain a little bit more why you picked those why it's necessary to have those yeah um and why not other options so why not just say every word rhymes with every word and start with the you know everything rhymes with everything everywhere and then let it sort of work it out on its own yeah just something that i was curious about and i'm i'm sure yeah yeah that's that's a very good question uh because the entire work is based on this um so the two assumptions they work against each other in a way um so the assumption that everything in rhyming position rhymes in a poem introduces a lot of noise in your algorithm right um because we know it is it is not not true um and the community detection algorithm uh so which makes which goes from this and makes that um is able to sustain an amount a certain amount of noise and say right i have seen contact between this character and that character but not often enough to be sure that they rhyme together right so it's the fact that we have enough matter for which the assumption is correct so in other words the noise is limited that means that makes the community detection still work um so that's the second simplifying assumption um if i had taken uh every single character to be potentially rhyming it would have generated so much noise you know you would have said i i think that every single character rhyme with each other the noise would have drowned the signal uh because if we look at uh let's take any poem you see here we have uh 10 characters in rhyming position and we are creating a graph with these 10 now if we said we're going to take these uh something like 100 characters and we said they all rhyme together they would all of this signal that is correct here that is to say yeah indeed in some poems all the characters rhyme um and we said actually everything rhymes in this poem it would have completely drowned the signal i don't know is my explanation any clear Chris yeah no that makes that makes complete sense so basically what you did was as a shortcut you took a non-arbitrary judgment um in the sense that okay commonly we see every other line rhyming so if we start with that you know very simple assumption that can help us sort through the noise in a way give us a foothold into it yeah that's the way to say that that's the the smallest part of expert knowledge that you can bring i don't even know if we can call it expert knowledge but uh especially knowledge let's say so so that being the smallest uh sort of assumption of you know expert knowledge um have you then also brought in so annotations that have been done by other scholars of poems individual poems taken that and use that to help the annotator sort of refine its learning or does that make sense yeah it makes sense um so this is a possibility this is not something i've explored uh but here these two assumptions that let me start with non-annotated text um you can also train and this is what Mattis List does in his research on on all Chinese he says let's take the annotation from uh i believe Baxter on the book of odds and put it into build a graph from these annotations look at what the communities look like and then comment on uh so i believe he comments on the existence of a new type of rhyme that was not reconstructed before so yes you can you can use annotated corpus and you could do anywhere in between so you could say i'm going to start from no annotation and annotate the corpus and then review some of the poems and you know fix the annotations and then retrain this kind of thing so yeah that's that's exactly what i was thinking of List's work and then just curious if you've gone ahead and done that already with your annotator and if you see any sort of changes you know significant changes um i guess it would behave better and maybe uh for non poetry then you would have to do that maybe even bootstrap and i think i kind of allude to it here when i say manually manually annotated corpora are set annotators already um and it's kind of what Mattis List does is he says right it's already annotated but can we apply community detection on top to kind of refine these annotations uh if if if i may just say so you know juli and i have talked about this sort of stuff before a lot and one thing that i find a clear way of thinking is that one of these assumptions one of these simplifying assumptions introduces false positives which is that everything in rhyme position rhymes that introduces false positives whereas the assumption that only things on even lines rhyme introduces false negatives and so the that's what he means in terms of the the assumptions kind of pushing in different directions so it's a way it's a way of saying like well you know uh uh then we get rid of some of the signal and we get rid of some of the noise and hopefully we're getting rid of more noise than signal uh and that way the the the community detection algorithm actually finds the communities and it seems to work so you know there you go right yeah and you could train on even numbered lines but then annotate on all the lines for instance you know that's that's totally possible and here i didn't try to annotate every character in the poem you know like i said i will not try to annotate the first line of equatrain but it is possible to do it's just not something that i tried to do because my interest was more in the detection of phonological change than in the production of a corpus but yeah that's an avenue that can be followed yeah that makes a lot of sense thank you for clarifying and also i just think it's it's interesting i know the focus here is on phonology but i think you could almost retool this just slightly and bring in uh paleography as well i can imagine sort of you know that sliding sort of approach that you use in terms of chronology but also you know space um maybe being used to say okay uh different sort of orthographies of characters and how they match pronunciations of different periods and so forth but yeah yeah yeah that would be super interesting and i i like uh the fact that you know there was this example where uh we see all the characters that have the same graphical components being just at these border um not making a claim on this but uh yeah this brings us back to uh graphical uh well sound components maybe being a good indication of a family of characters my thought is a way of testing sort of feasibility of orthographic variance in a poem based on sort of what we understand of the chronology and geography of its composition so yeah yeah uh okay well is are there there any other people who might have a question please don't be shy you know if you if you do but otherwise it's actually we're we're right up against the time anyhow so maybe it's uh it's good if you don't have a question as well um but uh let's see yes so uh you also have julian's contact information if you uh have a question uh and uh you know if you're interested in in this sort of work you're also uh welcome to get in touch with me what to say thank you for for julian for giving this very interesting talk and i think we all will look forward to the published version uh as uh as one of the as yes as yen in the in the chat uh also says uh so uh yes so let me say i don't know what to do in in this world a round of applause or you can you know make some little emojis or something uh yeah okay for julian and uh and then uh i don't know i'll see you at another another talk sometime thank you all for joining us