 I'm Julian. In the Chinese historical corpus, we have a lot of rhymed material that spans like very long time. So sitting, shu zi, and the epian, like we see in the bronzes, the mirrors, et cetera. And as Ash has mentioned, we have like these hundreds of thousands of poems and millions of lines. And these can be used as a source of historical pronunciation information. Or you can study it for other things. You want to study steely sticks of poetry. That's also possible. But here we're interested in if we annotated it, we could analyze phonological change, perhaps. Corpus is too big to annotate. And so the previous article I presented the method to try to like read. You read a lot of poetry. And from reading a lot of poetry without knowing the pronunciation, you infer what rhymes with each other. So at first you imagine everything might be rhyming. And then because there are some characters you never see together, you conclude then they probably don't rhyme with each other. And with this, you can annotate like 250,000 poems. So I published the annotated Qian Tang Shi and Qian Song Shi. I didn't do the, no, no, I did the Shi because that's easier. And then there's a kind of question of like what's the quality of these automatic annotations? Because you can automatically annotate things and make it rubbish. And so in the article, there's like some examples. Sorry, I will have to move. You have example of poems that are like not just AAAA, where the annotated detected correctly that it rhymes like AA, BB, CC, et cetera. I don't know if you can read, but it goes like mid-keep, one, 10, well, this one rhymes somehow. Yin, Yin, and then Zhong, Yong, Xiu, Hu. OK. And in the same article, like I also present cases where, well, it's clear that, so here my annotated for you see some V on the Jia. And we see, like maybe in Mandarin, we can kind of see, you know, like we have Huai. This one, I'm not sure why it's pronounced, actually. But I'm going to guess it's pronounced something I. And then Jia and Pi, sorry. Hi. OK. So we have I, I, R, and I. So in Mandarin, it doesn't rhyme. And it seems my annotator also considered it didn't rhyme, but it made me go Chinese, these were all I, I, I. So that's Jia was from Jia, if you follow Puli blank. So all right, the article kind of gave a few examples of, hey, sometimes it works, sometimes it doesn't. And overall, it seems to be working, but can we measure the quality of this? And so most of my presentation today is, right, how do we measure if it's good or not? Because there's 250,000 points. So we need the metric. My citation style is not very good here, but essentially, there's some literature on metrics for rhyming. There's these people, Haider and Kuhn, a metric where they check the metric is on the accuracy of saying whether two words rhyme, but not in the context of annotating poems. So they just say in the abstract, if you take these two words, could that be used for rhyming? And I am more interested in seeing, do these two words rhyme in a specific point? Because two words might rhyme in a poem and then at a later time not rhyme. So this is the kind of context I'm interested in. And the other paper from which I'm going to base a lot of the further discussion is published by Mattis List, Nathan, and Chris who are here, where there's this entire idea of the standardization of how you annotate poems. And towards the end of that paper, there's a proposal to use what I called B cubed metrics. I'll introduce what it means just after. And then they use it to kind of say Baxter and Wang for the annotation of the Scherzinger are very close. I think that's like 97% agreement between them. Right. So the idea is I'm going to present what B cubed metrics are, what are its properties and its unfortunate disadvantages, and why I choose to use another thing. So I realized that these notes might be difficult to read. So I'm sorry. But now you know what graphs and notes are hopefully from the previous presentations. So here we have a poem. That's the poem I've just shown before where my annotator gives a B on the Jia. And the ground truth is me going back hand and saying, no, actually, I think this poem is A, A, A, A. And here we see a representation of the ground truth on the left. So four characters that are all linked to each other. And on the right, this is what my annotator produces. So it says, quite high and high, like right together, their links. And Jia is alone on the side, if that makes sense. Is it OK so far? Yeah. All right. So B cubed metrics, the idea is that we say, maybe I can use the pointer. The thing on the left is the truth. The thing on the right is what we evaluate. And we compute two things that are called recall and precision. The idea is we say, for instance, we take the character y. We look at the cluster. So when you have several notes linked together, that's called a cluster. You take this cluster there. And then you compare with the cluster in which y is on the left. You take the intersection. So that would be this triangle there. And in recall, you say, did we get all of the notes? And here, we didn't get all the notes. So we say, here we have a cluster of four characters. While here, we have a cluster of three characters. So for the character y, we only identified 75% of the things that rhyme with it. And for the characters y, we say, OK. So we have a cluster of one here. While here is in a cluster of four. So we only got one quarter of the notes that are supposed to be in the cluster. So now we have average over all the notes. So we have three notes there where we have three quarters identified and one note where we have a quarter. And that gives us this number. So the recall that tells us all the things that rhyme, how many have we found is 0.625. And then we have the kind of reverse metric, which is precision. And the intuition behind precision is to say, when I said that two things rhyme, how often was I correct? And here, when my algorithm said two things rhyme, they were 100% correct for this poem. Because here I said, tiar rhymes with tiar, that is correct. And then these three rhyme together, that is also correct. So precision is one. And then we derive a metric that's called f1. So in mathematical terms, that's called a harmonic mean. So you do two times precision multiplied by recall, divided by precision plus recall. And you get this score that's 0.769. So this is how B-cube metrics work. And yeah, that's the basic principle on a simple four line poems. And now that's the question of, OK, let's take, this is the third poem of the shijing. And this is the second stanza specifically of the tiar. And we see the annotation by Baxter in his 92 book. And Wangli, well, I cite 2014, because that's the edition. I found these collected works. But I imagine that's his book from the 80s. And so we can see that they basically agree. Like, this is a poem that rhymes a, a, a. So I forgot what the characters are, but I'm going to just read them in like the Wangli pronunciation and reconstruction, where it says like that's muay, tuay, luay, huay. But on top of this, Baxter also says, well, there's these characters there that are like not in final position that also do rhyme. And well, I don't have a great intuition for this, but like if you look at their, at their Mandarin pronunciation, perhaps, like, you know, you would say, well, that's right. That's like tuay, huay, and then huay, tuay. And then we have like lei and then, well, huay just happens to rhyme in all Chinese. So Wangli and Baxter gave a different annotation of this poem. And this is the Baxter graph. So we have six characters that rhyme all together. So we have links between everything. And we have Wangli who has four characters on the writing. So now based on this, we can't really compute the big cube metric I was telling you before, because if I say, OK, let's take this character, the, what is this one, huay, and I say, let's compute the recall or the precision for this one. Well, it's not present in that graph. So can't compute. A simple solution is to say, right, for every character that Wangli did not annotate, but Baxter did, then add a little annotation saying, Wangli consider it didn't rhyme. And so what we get here is we say, OK, so for the huay and the huay, we add an annotation like, oh, this is B and this is C, just to say, they are not part of the original annotation. And what we get is this graph here. So now it has the same nodes as the one from Baxter. But as we see, it has fewer edges, right? Fear free to interrupt if something is not clear. So now that we have these two graphs, well, we can compute B-cube metrics. For recall, we get 0, I need to mention the direction. I consider Baxter correct here and Wangli to be the evaluated one. And so Wangli only identified 0.5 of the things that rhyme. And every time Wangli identified two things as rhyming, Baxter agreed, so that's a 1.0 precision. And when we do the harmonic mean, we get 0.66666. All right, it seems that all is good. But what if I add a third annotator? So now let's pretend like there's this guy, he speaks Mandarin, and he says, but well, it's like there's this way character at the top there. And clearly here, we have like this sway, tway, sway, blah, blah, blah, lay. So probably way is also part of the rhyme. And let's pretend that person is me. So I add an A there in front of way and say, yeah, it's part of the rhyme. And this is the graph for that third annotator, where everything rhymes. Now we need to align Baxter. So we have the B in front of way. And we need to align Wangli. So we add the D in front of way, right? That's the same principle as in the previous poem. Nothing has changed so far. Now if we compute the scores, OK, so we get a score of the third annotator against Baxter or against Wang. These are all right. These values don't matter. But we can recompute the score between Wangli and Baxter. And we get 0.73. But two slides ago, I told you that the score between Wang and Baxter was 0.667. So now just by introducing a third annotator and saying, oh, I need to realign Baxter and Wangli, the score has changed. Although, of course, these two publications of Wang and Baxter have not changed themselves. It's purely my computation. And this, I think, is the problem with B-cube metrics, is that it means now that, for instance, the paper by List, Hill, and Foster says Baxter and Wang agree 97% of the time. If I bring a third annotator, I'm going to bring to push that score between Wang and Baxter up. Any time you add more annotations, the score go up. So one thing we could be doing is say, so it means you can't compare results across time unless you recompute everything. And so every published result gets obsolete as the results are published. And you can only compare results by saying, OK, take all everything that has been published in the past and compute the new set of annotations. That's possible. But OK, that presumes you have the values and you can't quote old numbers. That's a bit problematic. Right. One thing we could do is we say, OK, we're going to always annotate every single character of every poem. Because then that's stable. You report the number twice. And it doesn't matter if I add other annotators. We will always get the same scores. But unfortunately, well, it is possible to do this. But every time you add these sort of dummy annotations, the score goes up. Because now you say, oh, yeah, Wong and Baxter, for this first character, they agreed that they don't rhyme. So that pushes the score up. And now if I do this, for instance, Baxter and Wong on this poem, it's not 0666 or 073, but 091. And even if I compare Baxter versus an annotator that says nothing rhymes, you get a score of 085. So the maximum is 1. So as you can see, we get results that are a bit difficult to interpret. We don't know, like, is 085 good? Well, I've just shown no, because someone who says nothing rhyming this poem is 085 compared to what we consider to be the truth. So that kind of means, and this is unavoidable, I think, with B metrics. And the reason for this is that rhyme judgment is edge-based on a graph. It is not node-based. And B cube metrics measure things based on nodes, rather than relationship between nodes. So any sort of metric that is node-based will have this problem. You can eliminate an entire class of metrics if you consider, like, me, that this is an issue. I've added some graphs here. Like, the two graphs above and the two graphs below express the same judgment of what rhymes and doesn't, but they just look different. And the fact that these graphs look different but express the same thing really tells us we should be looking at the edges. Like, these extra nodes that we see here, they're playing, they have no role. So the way I solve this is just to say, I'm not going to look at nodes. I'm going to look at the edges of the graph, which don't change based on if you add any other annotator, it doesn't change the edges of previous publications. You only add new nodes in the graph, but these nodes are alone. So it is fine. And then the metric that I propose is extremely simple. It is, OK, let's use precision and recall and F1, like we did before. But instead of doing it in that b-cubed way on nodes, we just do it on edges. So we say, OK, Wang said, this character and that character rhyme, do we find it in Baxter? Yes or no? And vice versa. And that means it's not dependent on the alignment. So it doesn't matter if you have Wang, Baxter, and a third annotator or not, you can compare more than two annotators, and you can compare results that have been published. So if you have a ground truth and you say, let's say Baxter is the ground truth here always, you would say, Wang gave this number and this third annotator had this code, then you can say which of the two annotator is better instead of having these abstract numbers that are not interpretable. And that means published results remain valid across time, which I guess is a desirable property of this. The concept is that as you see in this sort of graph, every time you add a new node, you add a lot of edges, not just one edge. And so in fact, the number of edges is a quadratic function of the number of nodes in the cluster. And that means that your score is going to be highly influenced by big clusters. So if you have a very long poem that you've annotated very well, that's going to do a lot more good than annotating badly 10 other poems that are small. Or if you have a poem where there's like 10 lines with the same rhyme and then the two last lines with a different rhyme, getting these wrong is not going to matter. But I think it's acceptable. First, I don't think we can really avoid it if we agree that we should annotate edges and not nodes. But also, it is more difficult to annotate long poems, so maybe we should reward this. I think people can propose other metrics that would be better, but I don't think these cons, maybe a cons can be avoided. Right. So now that I've talked about the metric, I'm going to use it. So I mentioned I published this annotation of the Chien-Pang shi and the Chien-Song shi. And I wanted to say, well, now I have a metric. Like it is what I published good. And Ash presented earlier that so I have three different annotators. So one that says everything always rhymes. One that says, let me check in the Guangyun. Does it rhyme or not? And annotate this way. And the last one that says, my annotator kind of has read a lot of poetry and is able to make better decision. These are the three annotators. And I classify on, do these three annotators agree? Do they all disagree? Or is there one of the three that disagrees? And we see in the corpus, basically, community is rarely the odd one out. 0.4, well, there's only 300 poems out of 250,000 where community doesn't agree with either Guangyun or naive. And naive agrees with Guangyun. And so I, yeah, where am I going? Yes, since I can't really check the ground truth on 250,000 poems, the idea is, let's sample some of these poems. So I took a sample of a bit over 400 poems. I went, annotated them by hand. I took some poems of all of the categories that we see here. And then I compute the score on that sample rather than on the entire corpus. Then you can try to compute whether it's statistically significant. I had done the computation. It's not presented here. And basically, the error bars were very small. They were so small that I was basically writing something like plus zero, zero. And I was, okay, just not going to write it. It took me around two and a half hours to annotate these 400 poems. So yeah, I guess annotating 5,000 plus poems is multiple at least than doing 250,000 points. And very interestingly, so the way I do it here is I do the sample and then I annotate it with my algorithm. And I said, now I correct the annotation so that I save time. Every time there's a poem that's just like A, A. I just say, okay, I scroll. And these took me two seconds instead of maybe 15. And most of my time, so half of the time was spent on like the 20% hardest poems, which is nice because it means I spent my time correctly. Like things that was easy to annotate, I just visually look, okay, it looks fine. I move to the next one and then my time is spent on the stuff that is hard where the computer is less able to do the job. And these categories kind of tell you where you might want to spend your time should you wish to annotate the entire course. Right, so results. I present precision recall and F score, but for the three annotator, this is on the entire sample of 400 poems. And what we see is that like the clear winner is always community, community has very good recall. Precision is very good. And the F score, the maximum value would be 1.0. So these are good results. Unsurprisingly, the Guangyun doesn't do too badly on this kind of corpus because it's a corpus that was written kind of with the Guangyun as the background. And we see that when the Guangyun says these two things rhyme, it's almost always correct. In fact, it's as correct as when the community. It's just that the Guangyun misses a lot of the rhymes. So things that poets used as rhyming, the Guangyun will say like, no, no, you can't use these two things in the poem. They don't rhyme. The fact that we get very good precision, but not that good recall means there were a lot of merges and not very many splits. If you had splits, then your precision would also go down. So we have this good number and right, if we can still maybe break it down by inter-annotator agreement as I was showing. And these are the three annotators. And we see that basically almost everywhere, community annotator is the best annotator. So when all of the poems agree, okay, well, it's when naive disagrees, that means community and Guangyun agree, then they have the same score. When the Guangyun disagrees, it doesn't seem to matter. Like the community is just almost perfect. And it's only when community annotator disagrees with the other two that it's like less good. And I think Nathan was asking earlier, if the three annotators disagree, so what? But it seems like community is still the best choice you have. 0.77 is not a great F1 score, but, you know. So the conclusion here is like, all right, what I've published earlier is usable. Yeah. And based on these scores, you could even say, well, okay, I know some areas where it looks like it's bad. So I could just go and try to hand annotate the ones where it's bad if I wanted to have a better thing. You know, like when they all agree, we've seen or when the Guangyun disagrees, we see, oh, the F1 score is almost perfect. So don't waste your time annotating those 25% and 63% of the corpus. You can just go and like in priority, annotate those like 6.5% and then maybe those 5.1%. And of course you can annotate these like 300 poems that are here. So we've already reduced the space of like, what you need to annotate by hand should we want to. That still takes like maybe a couple, well, 80 hours I think I've computed to annotate the rest. I'm not going to do it, but you know, that's more manageable than, you know, 600 hours. Right, so that's a quick recap. And then after a little game, but so we have a metric, there might be better alternatives. I propose that we use the one I present here, but I'm very open to someone saying there's something better. And we have two data sets. The one I've published before now is kind of validated, like it's good enough, like you can use it for whatever you want to be doing, like historical phonology, stylistics, you know, analysis of rhyme patterns, et cetera. And I've also published the sample that's hand annotated. So anyone can come and like write their own annotation and compare against this. And right, apart from this, I have a little quiz. Wow, that is dense. Can you date this poem? So we have three columns that represent early middle Chinese, late middle Chinese and early Mandarin. Well, I know that early Mandarin is a bit problematic given that maybe the poet is not from a Mandarin area, but the annotations you see, so like those D, A, B, et cetera, this is what my annotator produced. This is not what I believe. I believe this poem rhymes all the way through. It's A, A, A, A, A, A on the two columns, but my annotator was not able to pick that up. So of these three columns, which one do you think best represents the period of this poem? If you can read, yeah, that should be readable. All right, who thinks it's early middle Chinese? Okay, who thinks, yeah, who thinks it's late middle Chinese? A few more hands. And who thinks it's early Mandarin? I have the answer. All right, the correct is late middle Chinese. The poet is called Lu Chen and he's from like the tense, late 10th century from Hunan. What you can see is in this column, everything has the vowel A, sometime A instead of A, but if you look at the first, during the early middle Chinese period, this poem would have read, obviously, A, A, A, A, A, this kind of thing doesn't really rhyme consistently. And in early Mandarin, we have some A, A, Ye, Yue, et cetera. So the only column where it kind of consistently rhymes and has the same vowel throughout is late middle Chinese, which corresponds with the dates of this poet. So I quite like this. And the fact that the annotator is not able to pick that up is because it looks at the entire corpus from like the beginning of the tongue to the end of the song. And it makes a general model of this instead of like being able to say, oh, at the time of that poet, this is how people would have rhymed. And okay, same exercise. Now you can try to guess the period that maybe make a comment on where this poem, this poet might have been coming from. So here again, this is my annotator, which is wrong on this poem. And I think this poem rhymes throughout. So everything should be right. Let you read a few more seconds. And then, all right, we think this is an early middle Chinese era composition. I see no hand, late middle Chinese, early Mandarin. All right, so 100% of the hands that were raised were on early middle Chinese. That is perfect precision. Yes, this is a poet from the 11th and 12th century. I don't know what time the poem was composed, but can you guess roughly where it would have been coming from, like North, South, yay. Indeed, it is a Northern poet because we see that we have a very early loss of the codas, the K and T at the end. So in early middle Chinese, this didn't rhyme at all. We had some like Ike, it, Ike, et cetera. When the K dropped, if it was K with an open vowel, sorry, with a front vowel, I think, it gave rise to a year. And then that year assimilated with the vowel and basically left an E. And we get the same, as for the T, it just dropped and it just happens that everything here is it. So once you drop the T, you get E. There's this interesting character which has like a fairly rare pronunciation where it has a T in like some note of the, like, the Indian shouwen that says, hey, there's a pronunciation in the monk's way. It has this pronunciation with a T and this matches like semantically exactly this line. So like we can confirm like here it was meant to be pronounced as a T if the poem had been written at the time. So maybe the poet kind of knew, oh, there's a T here and I'm trying to make things rhyme with a T but my own dialect, you know, he was trying, I think, he didn't make just things rhyme or everything has to rhyme in E but or, maybe, I don't know, early Mandarin pronunciation but he had a sort of awareness. Oh, that's a Ruxiang actually. I don't know where he would have gotten this awareness. Yes. Well, I'm not entirely sure. I'm not, I didn't dig, but I think so. Yeah, yeah, but I think that's related to development. Anyone can call me out on this, I have no idea. But anyway, I think this is an interesting poem that you can see the poem and you can say, well, based on these, clearly it could not have been an early Middle Chinese because it's K, K, K, K, K and we see that in early Mandarin suddenly everything resolves very nicely. Aside from the tone, I believe someone asked a question about tone rhyming. There were rules, of course, in the tongue and song poetry but there's also a lot of like poets who will ignore this or like discard the question. So let me know your questions or comments. Yeah, you could do this with all languages but the reason we do this for Chinese is that where we don't have the alphabet expelling to tell us what characters are pronounced. So we are less in a need, I think, for others but that's still a relevant thing. Like one of the paper I quote earlier where I said, oh, they only look at where the words rhyme but not in the context of poems. They could do it in the context of poems. They look at rhymes in like early modern German, I think they do up to 20th century people. Having an absence in a rhyming relation to an edge to a node where we expect there to be a rhyming relationship versus to a node where it's just random, right? There isn't an expectation of rhyming. It means something different. That's an absence that is meaningful in a way that there are nodes where an absence of an edge wouldn't be meaningful at all. How do we, can we take that into account? Is that taken into account by only focusing on edges or do you think that's so messed, right? Where do you base your expectation on? You say there's no node where we expect there to be one. Maybe that's- Yeah, exactly. Well, we're just thinking about like the, and I don't wanna say naive, but like the naive annotator, right? Like there's a certain, there's a lot of there to say, okay, we expect line and a line and a. The fact that maybe there isn't line and a, line and no a is more meaningful than everything in the line being no a, no a, no a, no a. Yeah. I am not sure how to answer this question. I don't know if I even frame it right. It just seems like, maybe it's the precision tool, like a precision of agreements in terms of there is a rhyming relationship is fine, but also agreement and when there should be a rhyme, but there isn't something different. Yeah. I think this you would have. So my method is really trying to address the challenge of annotating these very large amounts of data that we have. And then you can say, now I have some specific questions to ask. I can do like some corpus queries on it. And there you could do, yeah, absence of something that I expect, but you already start with the hypothesis of I expect to find this. May I intervene here? Because I think that in a way you already deal with this. So we're talking about the BQ metric, which Julian ends up not using, right? But already there, like your intuition is, if two people disagree about a character in rhyme position, it means something different than if one of them thinks there's like a line internal rhyme and one of them doesn't, right? But this is sort of actually captured exactly by the adding nodes in, because you would only add nodes in when someone has proposed that those are relevant positions. So there's no, it's only the maximal hypothesis where you're saying all characters are relevant, that this criticism really, or to this point, really facts, because otherwise it's like Baxter did think those positions were relevant. Sorry, it's not really part of the question anymore, but I noticed that both Baxter and Wangli, they annotate the stanzas of the shitting. They don't annotate the poem. And if you look at the very first poem, you know, like once you have like this, like one, one, two, two, and then you go like two stanzas below, and it's again like your, your, and these are like annotated as different graphs in this. And I feel surely the fact that in this poem, we have like two stanzas that have, well, Mandarin all there, why don't they take it as positive evidence for rhyming? They just assume, okay, like across stanzas can't rhyme. Yeah, sorry, that's not sure. That's just the point I wanted to bring up at some point. I have just a trivial question. Why did B-Cube call that? B-Cube is called this because the authors of the B-Cube paper are called Bagat, Baldwin, and they are the French for something. Oh, yeah. Okay. Yes.