 So this is about a preliminary study about how to use network models to analyze old Chinese rhyme data, and the idea was somehow, I mean, I think it's an obvious idea to use network models, but I never had time to actually look into this, but recently for this workshop I then realized that I had some pressure, so I would just code the data and make some first applications and write some stuff, and I hope I can show especially that computers can be useful for our research in historical linguistics in general, but also for these specific cases, even if we do not believe in what computational methods what they give us, so we do not believe in any of them, but they can be useful to assist hypothesis finding or hypothesis checking, and I want to basically focus on that. So, but okay, network models, this is all about rhymes and networks, and let's start with rhymes, and I take the opportunity to take my favorite rhyme. Lucio said in the music, the moment you own it, you better never let it go, you probably get one shot, do not miss your chance to blow this opportunity comes once in a lifetime. So why do I take this? Because it's interesting regarding the rhyme analysis, if I was now given an exam and a teacher would ask me, like, how do you give me the rhymes in this poem or in this whatever we call it, then I want to give the following analysis. I would say that music rhymes was on it, go with blow, shut with nut. The questions of course maybe teachers would criticize it at a rhyme with it, but Germans would rhyme employ and deny. The question is what is better, I don't know, but this is what I would give here. Now, what about networks? Like this is the rhymes, and I just wanted to take the opportunity to bring this up. And networks is a simple data structure, we can think of that as a data structure which has a node which represents an object and an edge which represents relations between objects. We can tag or label nodes, as shown here by using different colors, so we can say this is the red node, the blue node, we give them names or we do something with the edge. We can do similar things, we can label them and rate them. So this is bigger or thicker than the other one, so we say like this is stronger connection, a stronger relation between the objects. We can also direct them, but directed networks won't be used in this application because we don't have direct networks of rhymes as you would see later, but this is just to illustrate it. It's actually really simple, but the good thing about networks is that there are many applications, both visually and visually, that we can use in order to get more once we have the network data. Now, how do we get from rhymes to networks? If we take a stanza from a shifting and we mark the rhyme patterns as here and red, and we say this character rhymes with this, with this, then we can just create simply a character as a node and the connections are just when the words are right. So if they rhyme in a poem, if we say that they're right, then we make a first connection in the network. If we add more stanzas of the shijin to this and again like identify what rhymes and what not, then we can make connections because in this text, we have both the character for si and si, si, si, si, and we would then just say like this also, like this occurs twice here and here, and then we can connect them, oh, so this is the simple idea how to use the shijin, how to make a network of the shijin rhymes. So how to construct such a network? First, I had to prepare the data. The starting point was to construct other rhyme annotations given in the techniques of Bexter92. The data was not digitally available, so I transferred the annotations by Bexter to a digital version of the shijin. I took project Gutenberg, we can have all problems, but I couldn't crawl the data from C-tex off because they would block me, and so I would just take this as a starting point, then I would go through the Bexter's book and always annotate whenever something rhymed. The digital version was corrected during the process where, so I would find certain things where project Gutenberg doesn't give the right character, not always. I was a bit under time pressure, so I was working every day half an hour in order to not get too bored of this, but after one month, then I would correct certain cases there, and furthermore, I had also a digital collection of most of the old Chinese reconstructions given in the new OCBS system, and this is what I had. Now, I organized the data in saturated, as you can see here, so first we have the poem, and this is something that we know, we have the stanza, something that we also know, 112 and something, and then I annotated something which I call a section. This is actually the things that are followed by a comma or a dot in the old Chinese text. For the simple reason that these are the potential locations, so the end of a section, if we strike off things like prefix or the things like the affixes where they sometimes add to the poems, if we strike them off, then these are the potential locations where we would find the rhymer's. So we could also use this to automatically try and detect what rhymed, but I wouldn't do that because I took the annotations. So if a section contained a rhyme word according to Bexter's annotation, this was noted as such, and if I detected further rhymer words, I had reasons to disagree with Bexter's annotation, this was noted in an alternative annotation. For each section, I tried to identify the old Chinese readings of the OCBS system, but this was not possible in all cases. There are some 400 readings I'm missing, and I still did not have time to check them. So the first step I then made was making a little app, because nowadays everybody makes apps, and I also wanted to make an app, and I made this app, which I called, actually I didn't give it a real name, but you can access this here by this link and I will just show how it looks. So as you can see here, we have all the sections, like for each word, like where they occur, like for the words we have them sorted here, you can go for the hanze, and then we have the pinyin, and then we have a gloss, then we have middle Chinese, where I had it, I mean, there are some things missing, old Chinese backstairs are gone, then I have can't view, and what I got from the metadata. Now, the cool thing about this whole stuff is the following. If I go for, for example, you can search, I can search for a character, like, I need to switch to Chinese, I want to see all the instances of war in the data, and I back them here. And now we can actually look in the columns where it occurs by clicking here, and then you see the rhyme words are annotated here, you see like here it occurs in this position, it rhymes with, and here are the reconstructions, and here is the next stanza, and I have this color schema in order to make it easier for the people to detect the rhymes. I think this is already really useful at least for beginners. So for me, it was useful after watching to see what happened because when looking at the whole, usually it is difficult for the people who do not really know all the classic text, neither words or something to identify the rhymes. But this was the first step just to illustrate this. Now getting back to the, now, the network I was talking about, how did I reconstruct it? I took all characters which occur in the shitting in a position that was annotated as rhyming, a country of access annotation, they are the notes, and the links between two characters are drawn whenever they are annotated as being rhyming in a given poem. And the number of instances in which two characters rhyme in a separate stanza were pounded and assigned as the edge rates of the network. And note rates were derived from the number of times the rhyme words occur in the shitting in a potential rhyming position. The data was then further normalized, we need to normalize when working with this kind of data by counting every pair of identical lines only once, identical sections only once in order to avoid phrases via too much rate. We know that there are many phrases in the shitting which are repeated across poems, and we shouldn't count them two times or three, because then we think that these words are really strong connected, but maybe it's just because the people would imitate the people who started to make them. So in order to, this is easy to do by just counting it once, so we count them once. But with less, when I was thinking that later, I should further normalize this because if we have a poem or a stanza in which three words rhyme, or with each other, then I would make three links between A, B, A, C, and B, C. This may be a problem because we could assume that people who write a long stanza and want to have all words to rhyme, they get sloppy. They get tired and then they think, okay, this doesn't really rhyme, but I still add it here because it makes sense for this case. And we can see this also in hip-hop. So in order, so we should, in order to account for that one should maybe divide by the number of the group, by the size of the group, but I didn't do that because it was only recently that I detected this error. But I think it is still useful. So we have this certain bias, but we can work with it. And now, regarding the analysis of the shooting at work, let's first look at the birds I drew. This is quite interesting in my opinion, because we have almost a small world graph. A small world graph is one where you can get from any note to any other note. So we have this disconnected component. Many characters are completely, but we have this large cluster of things that are all interrelated. We could now, what do we do with such a network? Actually, it's, okay, we could say, okay, let's look, let's zoom in, zoom in, zoom in, then we can see whether we see something. But it's not really interesting, actually. I realize that it's not really, it's rather difficult to find anything here in structure. So we need something more. Possibility, since we have the labels, and we can visualize them by using certain labels and giving them colors, we could say, let's take the five balls of our Chinese, or the six. This wasn't intended, so. And let's color it accordingly. And what you can see is that we see some structure emerge. We see that usually it's, so when we see something that this network has an internal structure. So it is not that this network is something, it's complete nonsense what we see here. And once we have something like that, the small and the world network almost, and which we would cover, we have a large, we will also identify clusters. And this are actually the things we are interested in, because these are the groups where we might then have a reconstruction of a rhyme group. And clusters are then actually which I will talk in a minute about that, are what we can then try to identify here. So I looked at transitions between the rhyme groups, because I was interested in that. So how many cases of R and up do we have on something else? But as you see here, this doesn't give us anything in this way. So what I was doing was maybe probably there was some bias in the coding, or I was, the approach like that, that doesn't work by just looking for the same rhyme, agglomerating all nodes into run rhyme group and then seeing how much they interfere. It's getting really messy here. What we can also do, and this is maybe more interesting, you will see if you look at a smaller set of the data that's making computing how often a certain rhyme group rhymes inside and how often it rhymes with other runs. So rhyme group here meaning Chinese Vexes, the Garret Instruction. And if you look at that, without tones here, though, but tones should be, I mean, blotter stop and S suffix, which is not ideal, but it's just a first step. You can then see like how often do they rhyme and whenever they have a really highly red color, then it would mean that they rhyme in themselves. So it's a really established group. We see some spots here where it is almost they rhyme more often with other than with themselves. So in these cases, actually, we might start looking for whether we find some patterns that we should revise, maybe, or we can find an explanation in the approach that I use. But even better, it's searching for communities. Community is actually the most interesting thing, in my opinion, when we're doing network analysis. And just to explain quickly what a community is, I mean, you may know what a community is, but it comes from social network theory and the idea. So people are interconnected. So we have some people here and the people, they know each other, X knows Y and Y knows Z. And if we look at the connection now, we can actually identify certain connections that are more important, that are more inside the group than outside. And we can show this, for example, like this. In this case, I would say that this is a group and this is a group, but these nodes are less important. Maybe they just know each other, like coincidentally. By this, we can then split a network into two communities, in this case, and label them accordingly. We can also give them labels like Chelsea and Liverpool, or we can assign numbers to them, like 1 and 3. And this was last Saturday. By using this, I was then applying this community algorithm to the network. So I was using InfoMap, which is, in my opinion, a really nice algorithm, but also one and best from 2008. And it is a fast community detection algorithm with very good performance. So in my opinion, my experience also working with other data, it always leads really nice results. It handles weighted nodes and weighted structures. We have most of them here in the network and uses random logs to the network in order to determine the best partitioning to communities. The results can be again expected in another app, and I will show this now. I can actually make it a bit bigger. So I will just plot all the results. The whole application, I will just show this later. Because now, let's look at the big picture of the... So we make this analysis of the network and split it into communities, split it into groups. The question is, what do we get? And if you look at this, then this is what we get. So you can see we have large clusters, but actually we have almost 400 different clusters. So the things are, this is actually only showing the connection from one cluster to the other, and they are all still interconnected. So search infrastructure here is still difficult. What do I find here? It's just maybe it looks impressive or something like looking at a network like that, having a nice visualization, but it doesn't really have... So I also looked into what they actually show. They are not necessarily overlapping directly with the OCBS reconstruction of the RIME. So that necessarily means the same RIME in the OCBS system could be split into two communities, just because these rules, which also make sense because we know that not all words will be used in the same context in order to RIME. So this is why interpreting the data is difficult, because a community, as I said, so a community identified by informat is not necessarily homogenous, since RIME is also not homogenous. We have cases, especially when a character, of course, only once in the shooting, then it will be maybe assigned to some place where we do not want to see it. A split of words with the same RIME to two communities does not imply that they do not RIME, and we always need to get back to the real data and see what is going on there. Now this is why the... This is why I think that the app is useful, because here we can search for certain parts, so it is really preliminary. So I can, for example, I will just show that I can, for example, say, like, give me RIMES, not in Chinese, RIMES equals R and I, and then I press OK, and it fits all the instances where we have any of these, like in the data. And we can see here, for example, here is a group community 16, and now we can click and see all the characters, where they are here, we can see the stanza where they occur, and we can check this with the other app. So if I click on this and open the link in the new tab, then I see in which poem it occurs. I think this is also useful just for checking the specific instances. Okay, now getting back to... So what I was then doing is just breaking it down. So I was looking at only certain cases, and what I wanted to look at was the R code, because this is something that is new compared to 1992, and this is something which may be worth looking from this perspective, because the RIMES may give us some evidence here. And here's again, like the same what we had before, like inside the group, outside the group, how did they RIME, and we find that we have certain valence established cases, like AI, AN, and AN seem to be rather clear here, but we have also cases where it is less clear. So R seems to rhyme more often with N than with itself. These are our cases we need to look in detail. So this is also what I... So I'm not claiming that we just use these things and show, like, look at this, but when those back in detail look at the cases, what is going on there. But now an interesting case, just to illustrate it, this can be useful to advertise it a little bit, because looking for example at RI and AN at this split, and we look at the network here, then you can actually see that it seems to have some structure. So we have a cluster here, we have a cluster here, and we have a cluster here. And here I use the annotation of AN-RI in old Chinese backster saga, as Laurent provided me with, and this is only the first one where I was ignoring the fact that in the book, actually, they have this nice practice of showing uncertainties. I was then, here I do not, I just ignore all uncertainties and say I take them as right, but if we add the uncertainties, the picture changes somewhat, and we see that the structure we have with IRI's being over here, the greenRI's being potentially over here, the blueRI's are, which should be a transition group, occur here, makes much more sense. So with the uncertainties added to the uncertainties, and the uncertainty are actually cases where we can now use this analysis in order to provide additional evidence in order to resolve them. So having something like this potential split, but of course it doesn't mean that, for example, this green character over here that it isn't really known on or something, it may be, it is, because it only occurs once, it's isolated, so we need to be careful when looking at the data. But now, again the same view, but now we use the community detection algorithm, and how it splits off this, and it splits it off into separate cases, as you see here again, and here again I don't show the unclear cases, and if you look at this cluster, for example, and I now go to the unclear view, then you can see that all the green cases which were annotated as potentially uncertainties are getting yellow, and we have them only surrounded by blue runs. If you look at it for closer, then we have these characters here, and now provided that we are really sure that the reconstruct is really sure about whether Shan should really have an R quota, and we had the uncertainties, we could make the conclusion that we say, okay maybe the whole cluster is R, should be reconstructed as R, and here is a point where the detailed analysis of course, and I'm only showing the preliminary stuff, could then potentially yield an improvement of the system by resolving the uncertainties. This is all I wanted to show, only a short outlook, and where are we with this? Well, right analysis based on the network approaches is still strictly experimental, we need to enhance the data, missing readings, drop lines in the shifting text, where I have all kinds of stuff, like the version I was using was not particularly good, and we need to enhance the models, better normalization, as I mentioned before, but already at this stage it turns out to be useful to inspect the automatically identified clusters in times of doubt regarding the reading of a certain character. I think it's generally useful to make use of interactive visualization techniques when dealing with huge amounts of data, and tools like the Shijing Ryan Rouser I especially used before beginners, but probably also for experts, impression. And where could we be? Imagine a world in which we have large collections of writing networks on all kinds of poetry, ranging from Shakespeare via Bob Dylan up to Eminem. Imagine we could gather important, we could then gather important information on writing behavior, both cross-culture and culture specific. We could track the emergence of hip-hop or the degradation of writing patterns in modern poetry, or we could even, we could even try to test the influence of the Judas call on Bob Dylan's rhyming practice. Imagine seriously that we could carry out large-scale comparisons on rhyming practice in different stages of Chinese, that we could propose transparently our individual assessments of what we think rhyme and pieces of all Chinese poetry, and that we could trace the history of Chinese and poetry networks. And I think, doesn't that sound like it could be interesting? Okay, and thanks to Laurent Sagan, William Becker perhaps, for discussion tips, ideas and data. Thanks to Bob Dylan, Eminem Shakespeare and all the other poets out there. Thanks to you for your attention.