 Okay, so I'm Nathan and I work here, so as you probably have seen me that much, I've been away for a year, but Sandy can't be here. So I want to thank Sandy for allowing this extraordinary talk in the sense of not only beautiful time, but in any case with that I'll introduce Matus. This is Matus List, Dr. Johan Matus List, who comes to us from CNRS in Paris. And his, my sister, our wife was in the European linguistics, and then he did Ph.D. on fancy computer signals. And now, in Paris, he works both actually at the East Asian linguistics research spot in the CNRS, and then also with some biologists. And he recently got the opportunity to congratulate him. ERC started a brand where we will be using computational methods to reconstruct the prototype of that. And the intention is that this will be based at the Max Planck Institute for Human History in Vienna, which you may know about, because it recently started and has a certain amount of attention. Anyhow, and we're working together on another ERC research project that is based here at SOAS, as well as the British Library and the British Museum, about the reconstruction of Herbert Burmese, which you may touch on in the upper-notch on social media. No, not so much. So, then I'll just read the file. Beyond the competency, is there a relationship between words and their implications for phylogenetic reconstruction? Yes, thanks for the invitation. I'm glad to be here. We have been talking for a week about work and on. The professional reconstruction goes back to Chinese, because I talked beyond competency, but only about Chinese as a main client. It was my university and project I was pursuing at the moment in Paris, which is called Lateral and Portugal Aspects of Chinese Dialect History, where I've been trying to investigate the development of the Chinese dialects, specifically concentrating on dialectical evolution, or on the dialectical change of the word, and across the dialects. So, what I've been doing here is actually based on a paper that was recently published with the Journal of Language Evolution, the newly established journal in Oxford University Press, and what I have been, but I have included some other stuff, because only taking it said would be, I thought it might be too boring, so I've tried to put some more things into it. And I'm going to start by saying some of the things about the language around it. And we know that our languages change, as I've already said, as long as they exist. New words get created, new and old words get lost, new words get created, and even the pronunciation of words changes slowly over time. When the speakers of the two language varieties separate, the change may become so great that they can no longer communicate, and what was once one language has now become two. That is our simple background. And even closer to that, probably 200 years ago, scholars were starting to get aware of the fact that languages can actually change and can be related. And these were scholars like us who were asking about Britain, who prepared languages like Icelandic, Latin, allotry, and Sanskrit, which are not in the same regions, or in really various geographical regions, but they realized that these go back to one common source. This is how it all started with comparative linguistics, and they established a new method for language comparison, which is based on intensive comparison of languages, and they tried to identify regular, recurrent similarities to prove language relationship and to reconstruct the development of language relevance. This is the comparative method, and due to the comparative method, we have now gained, maybe, insight into language history. But first, let us go back again to the rules of how we model language history. And then the first thing, the first first dimension would be our Schleifmann. Schleifmann was the one who first said that if we look at the language data, at the comparative data on the language, the result of our comparative method, on the method that we apply for practice languages, we need to illustrate them by the image of a branching tree, an bilder eines Gesichts verspillten Baumes. And then he would draw his first language tree. This is the oldest one we can find from 1853 by August Schleifmann. And what he was doing, so here, just, it is a little bit small here, you see, Deutsche Lettos-Laden, so it's like German, Slabo-Germane, Arnig-Pädaget-Ingo-Germane, shoot. But what is striking, what I think is really funny about this tree, is it's a German oak. It looks like a German oak, so we don't see anything of a ditch, as we see nowadays, where the people make that finite language construction, with these little things and the mathematical trees, but Schleifmann is really thinking of a tree, right? But this phase of denophilia, as what I call it, this did last long in linguistics, because linguists had to be skeptical all the time. And almost 30 years later, Johannes Schleifmann was writing this book from 1872, where he said, you can turn it as long as you stick to the idea that it historically has been developed by multiple applications of an access to language, that is as long as you assume that there is a Schanbau, a family tree of Indian-European languages who will never be able to explain all things, which have been assembled and are scientifically set by way. So he was really obsessed with science and then he came up with his theory that it's not like he wrote it all over the place, where he said, and this is what usually people mean when they talk, when they quote Johannes Schleifmann as the father of a grave theory, he said, I want to replace the tree by the image of a grave that spreads out from the center, concentric circles becoming weaker and weaker the farther they get away from the center. The problem is, Johannes Schleifmann never showed how a grave theory would look. So when Schleifmann gave us the German poem, Johannes Schleifmann didn't give us much. So in his paper 1872, in the book there's nothing, there's no visualization. And 1875, he's tried to make a safe point and then he's tried to confirm that there is no tree and then he's giving us this visualization. And this is actually nothing else, this is like an exception here, you see that he gives the polar of the speakers, here the Bulgarian speakers, here the Russian, and as he calls it a climbable suit, I think he's talking about right here the Russian there, and then here the Slavic languages represented in the grave theory. The problem that I personally have with the grave theory is that I don't know what this is supposed to turn to, because I don't see any history here. And if we compare what people afterwards claimed to be the best visualization, or a good visualization of the grave theory, be it maybe he would have the same nice little pie chart as Johannes Schleif had been doing, in Bloomfield, who started to, I think was one of the first to use the Isolos boundaries, at least on larger language families, here, who was using multiple bed diagrams or one pattern, who was using a network. I don't see any dynamics in here, I don't see any history in here. And I think it's worth to go back to another scholar, so first make it a summary, why? So trees are better, I agree with that. I'm not really in any difficulty with construct. We don't see the trees because of all the forest or trees or possibilities that we have there. Languages do not always split. I also agree with that. It's really a problem with the trees. And trees are also boring because they model only the vertical aspects of languages. We don't know about the splits, but we don't know what was going on between. But to say they might have been in a place to raise, because nobody knows how to construct them, because they still diverge, even if not necessarily in split processes, and they are boring, so they only have a resemble aspect of languages, if any. And here, it's useful to go back to Schuchat, who then in 1870, but it was only published later in 1900, and this statement, in one sentence, was an introductory lecture that he gave when he entered the university. I think it was in light, but I'm not sure at the moment. So when he said, we connect the branches and the trees of the tree with countless horizontal lines, and it ceases to be a tree. So we try to look back against the tree, but here he was showing us what he meant. We have a tree and we connect the branches, then it ceases to be a tree, but it's not a ray, it's a net. I think networks are a better way to unify this process. That is actually what I also was trying, I'm trying to do in the project, the tree at the moment in China, to take all the different aspects into account, to model languages to be more realistic. But now we have a problem, and this is actually not going forward, I won't show me any networks on the Chinese dialects because I couldn't do them, because there was another problem. If we look at lexical change, we look at, for example, any European word for sun. We have two possible, and the people reconstruct the word for sun by claiming that there was an irregular paradigm in the European language, in which the oblique cases were represented by the green line here, so the genitive for sun there was a nose with a genitive ending, so the green thing, and the nominative and adjective were represented by zone, so we had an alternation between na and l, and this very word already in the ancestral language. So how does this pattern act then, in the deceptive languages, in Germanic, it needs to be retained, because in Germanic we have the same pattern, but differently resolved in German and Swedish. So in German we have zone and in Swedish we have sun, so this doesn't give us any other possibility than for a soon that the pattern was still present in Frodo Germanic. If we go to romance, we know that it was already lost, so we have sort of this, or something with the ending, sort of in Frodo romance. In French, actually in the same time, we develop another word, and this word actually then led to the French word for slate, so we have some morphology change going on here, and in Spanish we have the word as it was more or less in the left language. This shows that we have some patterns, which are quite complicated, we have morphological change, we have semantic shift also going on from whatever, meaning that if it says something like small sun, then it goes back and shifts, or whatever is going on here, so we call it semantic shift that is going on. We have a complex interaction, even though we have a tree here, because this is a tree, there's nothing going on, so the variation that is actually leading to problems is interpreted, and this is the variation I want to talk about today, this is the question of beyond context. Now, just let us try to be a bit more precise about lexical change, and in lexical change, people usually start from susu, we can always start from susu and say we have the arpa and the image of an arpa, so the tree here, the tree, and we have the form and the meaning as the two sides of this linguistic side. However, this is not like the coin and the two sides of the coin, as susu said, but it is more complicated, we should not forget that, we should not forget that we are talking about a language, and it's a tryer, it's better to think of it as a tryer because language is important. If I only have a meaning attached to a form, it doesn't say anything unless I know which language I'm speaking. And the people for this, we have three ways by which lexical change can happen. Lexical change can happen and not the form, so the form can change, full morphological change, it can change along the language, and this is actually something I'm very aware of, it switches from one language to another, so this would be borrowing or contacting and it can change along the meaning. This makes actually more or less different things as in this example that I gave for some, and he came out back in 2007 he used this schema in order to show that, to show the different variations of as equal to different dimensions of lexical change. Strait change is important, semantic change, morphological change. And we have just a few examples, so we have an old high German word for cop, for head, it's cop, and it meant cop and then now today it means the head, so cop in a standard German, and if we want to make a verb out of it, then we can add a little morphological change and then we have kruppen and we can use our head to to play football and here we have a borrowing case where we actually have morphological change and from the English word for world cup we would borrow it into German but we retain the word in the German word, so we say Weltkup, it's not Weltkup, it's misspelled here and so this is just illustrating these different variations. Now the other point is what I'm working on at the moment is quantitative approaches, so and this is actually where biology comes in, as I'm working in two departments in Paris with a biologist and with a psychologist, it is important to talk a little bit about that before I come to the point. Historical linguistics and right now and then Google suggest evolutionary biology, so this will maybe in 10 years or maybe never, it will happen but I just wanted to have something nice to show, but you know that recently biological approaches are more and more common and the people say actually take biological, powerful methods to analyze the languages and tell linguist things that and it's all completely novel but it's not necessarily completely novel because interactions between biology and linguistics go back far far long in the history of the due to disciplines and we can even start by looking at, not biology but at geology by looking at Lyell's work on the Antiquity of Man where he wrote, if we knew nothing of the existence of Latin if all historical documents had been lost, if tradition even was silent as to the former existence of a Roman Empire and then I skipped some things here and duration it would enable us to say if we compare the languages that at some time there must have been a language from these six modern dialects derived the origin and common what we're talking about here is uniformitarianism and this is the uniting point, the uniting factor that unites the two disciplines, the unites biology, linguistics but also geology and the uniformitarianism in the version of Charles Lyell said there's a uniformity of change, laws of change are uniform, they have been applied in the past as they apply now and will apply in the future no matter at which place graduality of change, change proceeds gradually not abrupt and the third point was abductive reasoning we can infer past events and processes by investigating patterns of those in the present which become the key to the interpretation of some mystery in the archives of remote ages and I think this is the main point that we have in common with biologists we have only observed data that we observe right now and we try to infer what has been going on in earlier times but we can also find if we look at August Schleicher and look at his literature from 1848 up to 1863 in the books then we find that August Schleicher is repeating the same things he talks about language change as a gradual process at least certain aspects of language change as a law-like process we know about sound laws and linguistics as a natural process which occurs in all languages so the universality is addressed here as universal process which occurs in all times and he also says that it allows us to infer past processes and extinct languages by investigating the languages of the present we find August also reflected in linguistics and I think this was the reason why we found these commonalities between the two disciplines and why we also use the trees to represent certain kinds of divergence or certain kinds of history it was not many people claim nowadays that it was biology that influenced linguistics already by then already in 1850 August Schleicher saw Darwin's book and afterwards he said oh now I need to make trees of linguistics but he saw that Schleicher's book was from 1853 by then he knew nothing of Darwin's work he was introduced to Darwin's work according to his own letter that he wrote to Ernst Heckel and there he was introduced in 1860 by receiving a translation from Darwin's Origin of Species into German and he was reading that and then he was writing a reply to that so before he even said like our trees are actually better than yours because our trees are concrete I can write Indian-European words there and Darwin only had a abstract schema because I didn't dare to say anything more so it was not the direct exchange of ideas that led to the development of similar approaches in biology and linguistics but the astonishing fact that scholars in both fields would be about at the same time to detect striking parallels between both disciplines both regarding their theoretical foundations and the processes they were investigating and linguists were the first to draw trees that is always important to tell that to the biologists we have a paper that just came out in biology direct this was a collaborative work with the biological department where I work and there you can see here the trees that were published are trees and networks networks it's really this is a network this is a tree like schema there is no but this is actually showing even the commonalities between these dialects here of Germanic dialects if we go back we find far before 1700 we find already like first images of language divergence using tree or branching tree or network like patterns and the first ones come only up 100 years later in biology so it is I think if we say like before 1700 and after 1700 then it's like 3 to 0 for linguistics against biology at the moment 150 years later we are in a situation that people start so there is some linguists have the impression at the moment and I think it's still but it will get better during the next years I think but we have the impression that we have been living in this nice ivory tower here we've been doing our work on inter-European and other language families and we've been working but all of a sudden there is this storm of bits and bites by the biologists created by biologists and many people are just afraid that this storm will break apart our nice little ivory tower but the biologists or other people say so we have actually two groups there the other people say no it will help us that the ivory tower will shine in a new glance and will give us a new bright future so I think we just by doing it right I think this is true but we should not overestimate the possibilities of what biology can offer us but we should also not underestimate the things that we can learn so having some reconciliation going on here we should not be too humble with our own things that we have been learned during the 200 years of research in historical linguistics the quantitative turn you could call it a quantitative turn because it was something like 20 years ago now almost more and more papers turn up where they say things like inter-European and computational cladistics language trees, classification by numbers classification of the world's languages inter-European tree and networks that is a run paper where I was involved in it there but the quantitative turn was way earlier it was like in the beginning of 2000 the people would start using linguistic data and biological methods how does this look in general and this brings us actually right to the point what I want to talk about how does it look usually people start from concept less from Scottish they start a lexical word list they have a word list and word list has some headwords basic concepts basic concepts so we wouldn't compare fridge refrigerator or something like that we could only compare words where we think they were also present 2,000 years ago 4,000 years ago 10,000 years ago so we take words like hand, blood, head, tooth to sleep to say and then we translate them into a couple of languages like here like into German, English, Italian and French and then we code them for cognacy so we code them by saying that German and English are cognate they go back to the same root in proto-Germanic we know what we know the history pretty well so we just code it in this way so we give both the same ID it's a one here for blood and blood it's the same 3,3 and Italian sangue and French song is actually also the same for for so we code them in this way and the colors just illustrate where we have here we have a tooth it's reflected in all four languages now how do we use this in order to reconstruct something or to do a quantitative analysis we tabulated by showing actually still the basic concept and implicitly we have the knowledge about the proto-forms here but this is only implicitly because it's not used in the analysis but you can see that the proto-Germanic form which was blue in here is now still blue here and you see the hand and now the question is this form is present in English and in German so in these two languages but it's absent in Italian and French so by doing this trick we can actually tabulate the data and have a file that we can actually directly feed into biological software that then tries to infer a tree from that by saying that this pattern gives us some indication that English and German are closely related and here the blood is the same and we have the same for the other patterns and we can then plot this onto a tree we take the matrix and then infer whether a new word evolved was lost or was evolved in this case people would say the word we have the same pattern here ABC so we assume that these are all different words and A is present in English and German nowadays the word A was involved from Indo-European or whatever we assume was here to the Germanic languages they gained another word so they called it a word gain or word origin and we also have a lost process here so in French the word in C was for example lost and so French is left with zero words here but it's only an example of course we have more data for that but this is how it is done basically so this is the basic idea behind all these processes what is important here is to look at analogies and parallels so are we discussing analogies that people just made up because they see things are similar or are we discussing things that are real parallels have been often people have proposed that they are like these striking parallels between the process of historical linguistics, language evolution or language history as I prefer to call it and biological evolution and they say something like that regarding unical replication we have language and languages we have a word a gene versus a word, asexual and sexual reproduction we have learning cladogenesis and language split and something but if we really carefully review the similarities here and we need to ask ourselves whether these are things that we want to see or things that we see because if one only looks long enough at two really different objects one always finds a way to unite them and to make them similar and here the point is there are also many differences between species and languages so regarding the domain we are talking about poppers world one here where we are talking about biological objects and we are talking about poppers world three so the world of ideas this popper paper is really interesting in this context but I cannot go into the details here but it's just a nice paper on that it's 78 I think three words tunnel lectures on human values I think Ray mentioned that this language is qualified as poppers world three so mechanical and arbitrary the relation between form and function we know Sir told us that the function the form function thing is arbitrary and biology is mechanical if you have a protein that codes for something then it's just mechanical process that codes for something else especially what is important is also sequence similarities universal versus language specific so words are similar with respect to the languages so we cannot have general similarity to be true words let's say it wouldn't tell us anything and the differentiation of languages definitely not necessarily always tree like as we have in most species they evolve in a tree like manner in biology but not in language history difference in the alphabet is quickly to talk about that because it's always interesting and it's interesting how it takes a long time to explain biologists that fact so the alphabets in biology are universal we have either the four letters or the amino acids or the nucleated acids I think so I'm also not really on but they have the 20 letters then for all the four letters but in languages we know the phoneme system is something that is defined for a certain language that they have the distinction inside the language so it's language specific if you look at the it is limited but in our case the size of the alphabets is what we were writing we have languages with 50 phonemes and languages with 120 phonemes or whatever we come up with also people come up with in their analysis and our alphabets are mutable because they change we know that when a language changes then it changes certain sounds and this means that the alphabet changes so we could even think of imagine that in the future we will start producing new sounds that have never been used before or we could think of a situation that in the past people were using certain sounds that are no longer in use nowadays so we could think of certain things like that you could think of similar ideas in biology but in languages it's much more easy to think of scenarios like that so the difference in the processes here it comes to the very point that is the problem of the analysis as people show it nowadays with the gain loss and the loss of words what they call the gain and loss of cognate sets they make an analogy between genes and words and they say what they call homologs and biology and cognates and linguistics but this is completely different the term homology was coined by Richard Owen who distinguished homologs as the same organ and different animals under every variety of form and function from analogs as an organ and one animal which has the same function as another part or organ in a different animal so nowadays it commonly denotes a relationship of common descent between any entities without further specification of the evolutionary scenario with respect to specific scenarios of common descent molecular biologists further characterize relationship between homologous genes by distinguishing orthology, paralogy and zinology as they call it and this is how you could visualize we have something a species here and then we have a process of speciation where this gene is then inherited in this point and now genes can duplicate so they can be duplicated in the same gene and they use the second time for another function so they will slightly change so we have two copies of the same gene in this species at this point by the process of duplication and if we now look at the relation between A and B biologists call them paralogs so they are not strictly related but related via an intermediate process that they call duplication now the same is people realize in the 50's when they looked at bacterial evolution and realized that they evolved so fast we cannot really explain what is going on there that this must be horizontal this must involve lateral transfer of genes because otherwise you could not explain that they adapt so fast to new environments when they detected that this meant that they need to define another category of relatedness between genes and this is what they call xenologs so we have a lateral transfer going on from here to there introducing a new gene to this species and now if we look at the C we know that it is a xenologous relation with all the other in the extant species you can easily see or what I have tried in my dissertation in 2014 I have tried to compare this directly and then add terms for linguistics because I thought it is a good idea so we have indirect involving lateral transfer in the extant as a we have orthology, paralogy, zinology in linguistics when we talk about cognate words we usually mean the both we don't make a distinction whether there is something going on that is a further changing word like morphological change, semantic change we don't care about these cases some people say there is something like indirect cognate relation but this is only in the handbook by Tress that I found this so we have many empty spots also there is no change going on like cognate and we have no way to actually just denote all the words that are historically related so what I proposed by then was some new terms but terminology but this is actually not really important because now later when preparing the paper I realize it is not that easy it is not that simple so entomological relation direct cognate relation what does it mean to be direct does it include semantic shift or not so actually what is better and what is now like that was published recently in the journal of language evolution is if we just look at these three things of aspects of lexical change so static, morphological, semantic we have the three aspects dimensions of lexical change by give or done as he mentions them and then we look what we can have where can we have variation and what people actually use the cognate relation is the influential person in the modern applications because the people they don't use his methodology but they use his methodology to assemble the data in the first instance so in Swadesh terms he didn't care for morphological unity so he said it can be different or it can be the same so I will just still annotate if I still have a little more theme that reflects the word I will say this is cognate with the other thing although it's not really visible but he cared for semantic unity so he said everything that is different in semantics I won't count it in lexical statistics of a lot of chronology Swadesh was also against borrowing so they tried to rule out borrowings more or less traditional notion of cognacy would actually say we don't even care for semantic shift if we try to make a comparison with biology then we would say that the direct cognate relation would be orthology we need to have complete so no borrowing no morphological change this would give us this would give us what the people in biology call orthology so it's really clear cut things without any variation without any processes only inheritance and oblique cognate relation would be paralogy I don't know but we could actually we have 27 possibilities here by putting the crosses anywhere but the problem is actually how do we figure out what process was going on if we compare two or more words across the language and this is the first attempt where I tried that I will control for semantics in this approach but I will try to show how we can actually get more out of using morphology in the data but let us just another example for the different processes I showed that again I showed that before the example of inter-European soul and son but you can also look at cases like inter-European where you have do as a root for to give as an interverb and don't know is what is given and this root would be only one would assume that this goes back to inter-European which uses for that was is given the present and in Latin we have donom and data as two words and we have donare which makes a verb out of the noun dono but in Latin speakers they did not know that these words are related so they did not know that data and dono are related going back to the same root but it's strictly speaking it goes back so the words are cognate in this sense and then if we look further at the processes we see that French actually used donare in order to make the new word today to give so it's donne and in Italian we have data which is to give so we see that there is a clear process of replacement going on because the French people would really deliberately choose that they do not discontinue this use of the word data to express something that means to give but use another word for that a semantic shift going on there from the word and this is not captured in the current databases because the databases would usually say that all these words are cognate which is just problematic our last aspect is difference in the processes in semantic change when the people apply these methods they say that semantic change can more or less be handled we can just ignore it because we can look only at concept in the same meaning so we just use them and tabulate the data and then have the algorithms decide what was the tree but we can actually imagine that the situation like that we have time point one and we have something like the peeping tom so we look through the house and try to see what is going on in the windows and we hope that they have some light going on there so we see words like hand and day colored in different colors so hand and arm and meat now we look at the time point two look again at this and we see that hand and arm are in the same color now and day and meter in the same color so when we don't know what's going on inside and this is actually the approach that is currently being done because we do not know what is going on because we only look at certain semantic slots but what we should do is actually like uniting them and looking what was going on in between because if we open it then we could maybe find certain processes that actually lead to this so sun might replace the word for day hand and arm might merge and mean the same as in the Slavic languages or meat and animal we find all these cases in the language all over the world so the point is actually this approach what we have is something like looking at through windows so really looking through a glass darkly and then trying to figure out what was going on and we have the possibility linguistics to actually look closer at the data and to really look at the processes shifting the paradigm lexical change in the Chinese dialects this is lexical change in the Chinese dialects if you look at German, English, Spanish, Swedish, we find the word for moon it's easy to handle and I align it here so in order to show all the sound correspondence also a potential so we say if I run words, we say it's clear cut case the words are all cognate, it's no problem but now if you look at the word for Chinese the same words for moon but in four Chinese dialects Fuzhou, Meixian, Guangzhou and Beijing then we find in Fuzhou and Meixian and we find all these cases of variation and if we align them it's even impossible to align them properly the words because in this case we don't know the meaning actually, the alignment the alignment also only makes sense for the first part because these all mean this is some statistics so these all mean moon in the language and this means something like shine and glance or whatever other words just a little statistics so by using these words in the data by Hamad and Wang for Chinese we find 30% of all words are of this structure and 50% of all nouns so we have a clear cut problem here because people talk about let's identify put them into an algorithm and then we construct a tree and we see here that it doesn't really work like that it's 50% of the words exhibit this partial cognizance so this is just to illustrate the structure if we have words of this structure and the other we cannot really model it in the same application like not even ignoring the problems that I just mentioned for inter-European itself so in the Chinese dialects it's really a problem furthermore these processes these patterns that we see here for the similarity between the words they could give us hints regarding the development of the dialects and actually they give us concrete hints if you look at Fuzhou we know it's a mint dialect it was the first to spread off the Chinese tree if we think of a tree but of course in the Chinese dialect but it's the most archaic and the first to spread off if there was a tree like divergence or in the parts that diverged tree like in a tree like way and then there was an innovation I think the innovation would be going on here in the new pattern this would be then light process in the Meixian Hakka case this was then discontinued in the Mandarin dialect and what we see in Guangzhou is actually a clear cut borrowing because this is a recent word if one looks at older parts of languages like 50 years ago it's already not the same so these all processes are in the data by looking at the different patterns of partial cognizy we can infer innovations loss and borrowings and cases like this I would jump over this slide because this is not necessarily this is one way one could handle it in biology but let's go to the we cannot model partial cognizy sufficiently when restricting our analysis to binary gain loss models as they are common in Bayesian phylogelastic analysis partial cognizy is too frequent to be ignored not only in sign of Tibetan languages but also in other many other language families Austra, Asiatic, Mumian, Ticadac and if we define binary cognizy of basic of common morphemes the majority of the items in our data sets will become cognate and we will lose a great deal of phylogelastic signal if we define binary cognizy of the basis of identical morphemes in all words the majority of the items in our data sets will become non-cognate we will lose again a increased deal of phylogelastic signal now look at this case A means the morpheme of the word is A and if this is also A it means that they are cognate so we have the same structure just illustrated for the Chinese dialect so we have AA here we have a C and a BB here if you look at these patterns and we model them in the way that we say this is different from that from that then we would give it four colors and they don't give us zero they give us zero phylogelastic information the phylogelastic analysis doesn't know what was the original word that was used we don't know anything about that but now if we do the other the opposite we say whenever they share one element only in common then they all get read again we don't have any phylogelastic information because we do not know which of the concrete terms are used in this language which is an interesting question now when dealing with language families in which compounding morphological derivations is so frequent that it covers more than 30% of the basic vocabulary we need to incorporate partial cognacy into our phylogelastic models and this is how we do it we use a different way to represent the characteristics by using multi-state parsimony and lexical shape parsimony is just people have been criticizing that the Bayesian approaches because they are better or whatever but in this case the data is so sparse that we cannot really use it and it was more for proving a point but multi-state means essentially that I try to define every word that is that is different in terms of this in terms of its morphological structure as displayed here will be modeled as one state in the same slot of concepts so we have a state AC and then we can define certain transitions between these states so we can say to get from AC to ABD is really difficult because we need to lose the C and need to add the B and the D but if we have an A getting to AC is rather simple so A is yellow A getting to AC is like this and if we go from AB to ABD as in the blue case is even easier so from blue to green because we only add one element and most of the time this is a suffix as we know in the Chinese dialects usually they add something there and we can actually model this and if we use this in order to find the most parsimonious tree there is only one solution that we can get and that gives us back to the thing that we know also is true in this case because the middle Chinese or the old Chinese word for moon was just this one element they didn't have multiple morphemes they didn't have this compounding structure we can compare this also the powerful so if we use these penalties and I have some way to compute them automatically because I didn't want to nobody knows how compounding goes on this is an open question I'd say an open question in in language in historical linguistic whether there are patterns of compounding that are universal across language I don't know but for the Chinese case we can more or less have some assumption and we know on AB and can become so something like that we can add a suffix or we can lose a suffix but usually we would assume that it is more adding of elements than losing of elements then we can actually make a model that we say that we compares two different states of the same case or two different morpheme structures and say which one is in which direction would the change go and by using that we end up for the pattern that I just showed with only one possibility where we have four possibilities here even more than that and here we have already two possibilities if we only show if we show that we have transitions but do not show the direction so adding transitions reduces the tree space and adding the directions reduces the tree space even more and then we are actually already close at the analysis that I then took out so the analysis was just based on modeling the data in this way and I used 22 Chinese dialect varieties with Chinese character readings that was provided by the dataset by Hamad and Wang 57 nouns I selected which ancient Chinese forms express the concepts that are known to us and also still at least one dialect, three reference phylogenies I tested one by Laurent Saga who made a proposal on how the Chinese dialect evolved one by Jerry Norman and one by in Yo 92 I don't I don't remember his Yo Yo Yo something grand and four models of lexical change binary, unrated, rated and directed in order to compare them now we have the different phylogenies it's not that easy to see so this southern Chinese is a I compare then how often do they actually give me what we know because we have old Chinese as an external language that we can actually say old Chinese had the word for moon is this word and the algorithm would spit out has a tree and would spit out which word was used in old Chinese for moon so we can compare that and say it's a hit if the algorithm produces the same word, if it produces another word it's a fail and by that we can compare the data and say like we have a hit and fail and have different cases and it is what is happening here what you can see so we find that the tree by Jerry Norman gives us something like 0.79 with a directed approach so the other approaches are close going close to random so we find quite something here but we go even higher if we use the better trees so we find that the reference trees here is a good indicator helps us to find better scores but the directed models largely outperform all the rest in this case we have 82% of hits versus 18% of fails and as opposed to 76 so and 55 if you use the simple approach that they currently use for the basic phylogenetic reconstruction so I think this is evidence enough that we should be really careful when modeling when modeling in the data sets in our gain loss models because we lose just too much information and by adding more information we can maybe improve also the models on the phylogenetic reconstruction now the last point what I think is interesting about this I can actually use this in order to illustrate how this to show scenarios how the characters might have evolved so here you have the different words like moon moon mother moonlight these are all like the motivations that then make the word so in many Chinese dialects they say moon mother or moon father or something like that for moon moon bright moon shite moon light suffix and we can say so okay in the answers to language we have that then the algorithm says we have the blue dots we have the blue dots going on here we have an innovation this goes on up to the moonshine innovation goes on up here the innovation occurs two times don't ask me why it may be borrowing maybe something but this is something the algorithm is there to afterwards figure out what was there so this can be handed to experts and experts can discuss what we find and then we have the pattern that we see here but this is a scenario solution that is optimal with regard to the requiring the least amount of steps and explaining most of the data in this way with the least amount of yeah of steps and of processes that are going on there there's an interactive application of this and I will just click on this if it works and hope that I have internet connection yes so in the interactive application you can for example see that we have many different scenarios for the word of deck but this is just how it works in phylogenetic analysis you think 20 trees have more than 10 billion I cannot even number them 20 trees have so many possibilities then think of all the possibilities that could happen if you have the evolution of one word along a tree into different patterns but we can then what is the main point about this thing is if we look at certain scenarios I can actually click on them and here I have two possibilities and of the two the development of the word for egg which is then reconstructed as to be the word for this older case and I click on scenario number two and we can actually here see on the tree what happened so this is in the Chinese characters we see here it was retained whenever it is an innovation then I noted by having the double around them so we have the first innovation going on here in the ancestor of them this is Mandarin but including Gan and Xiang dialect where they shift from Luan to Dan so whatever you have today in Mandarin, Chinese, in Zidang and the chicken egg this is an application that is just I think what is important about it is just this linguist can use this and look at it and maybe also criticize the approach and say that I took some bad numbers, some bad data but maybe also learn about more about the patterns of lexical change in the Chinese dialects or in other dialects so I think now we are here and I think this is already close the outlook is only the typical blah blah blah because one needs to always to have this so it's great future and we need to be nice to each other and we also need to save the environment and I think here, thank you for your attention you are not doing it on IPA or you are doing characters so then that assumes that an etymologically related word is always a universal character it's like, yeah this is actually what Giverdant says in his book on lexical change which is really interesting I think but it takes some time to figure that he says I ignore sound change because sound change is something that just happens to words and it's regular so by ignoring that, when looking at lexical change so the continuity between the words is given no matter whether they are pronounced differently now and I think it's a good ways to switch that often not look at that aspect but of course lexical change should add the fourth dimension but I think Giverdant is a good point to let's only call that lexical change the other thing is sound change but you are right in this case what is ignored are all ideas about patterning about whether the words have aspirated certain parts or have other sound changes what word means is things like which maybe I guess we sort of touched on with again for egg but we take something like true which is off now with Mandarin if you have true changes this is a mistake true is still true do you understand if I take the Chinese character I approach you mean in this case this is a coding problem if you use a benze and this is a problem of the coding problem of using the benze in Chinese dialectology I assume that there may be also a case in the data because I took the data so I deliberately did not I checked the data and I corrected certain parts because it was not always explicit what I meant there to have this way of coding but it is clear that what we also need is a careful analysis of the Chinese dialect Chinese data in terms of this of this connection so if we know that this is yeah more or less more or less I mean Wang Feng made it he published it in 2004 but since then it is not available online and he was actually really going through I want to have a Scottish list of the dialects so he was looking at 20 dialects and then he was coding them himself for Cognizy oh so it is not from that way no it is from Wang Feng and Wang Feng used just the tradition I mean I think he is actually more informed because he would actually overwrite certain things where you have different characters in terms of benze and he would say these are still related so I found certain cases where he says that but the question is also to which degree following the morphemic level which is a compatible benze approach yeah more or less but of course I mean this is something that is important also that if the data contains certain things then an algorithm can be as shiny as it wants if the data has certain things that don't work then I think in this case more like for example I have my suspicions also that I think Wang Feng tried to I think he didn't even try to to tag for borrowings but there are many things that are going on there that are laterally transferred I think there are some inconsistencies the pattern evolves two times in the same way I think it's rather unlikely and it looks more like there are borrowings going on in the data and then unfortunately he only gave the characters I couldn't even check the sounds so I couldn't really yeah if you see the original data it's a bit difficult I mean I understand so it was not the purpose of this but it's actually I think nowadays it is better to do it than differently I think it's possible to examine structural things so we're working on a more more syntactic level if you want to see more more syntactic context so we're in the centre of funding and we're looking at certain structures and exploring to what extent by the genetic and other approaches could be used I'm just wondering I think in these cases it's really important to think whether we're just really talking about cognizance or independent development and we have many more instances where we just don't know I mean the people use grammatical data usually in the typological database they use them as saying if we have SVO in this language and the other language is cognizance potentially but Chris is it really cognizance I mean it's really going back to the same development this is a problem of structural data that one always has I think what is really also needed here the main idea here is to say we linguists have some idea about directions of processes and directions give us valuable information so if one has an idea on more for syntactic patterns for example that this pattern can only evolve if before it was something that can only go in this direction this should not be thrown out so the computational people usually say no, let the data decide the data will decide everything so it's like well it's also we need to like the algorithms also need to be guided by us if I buy these automatic vacuum cleaners and put them in the wrong room and I say oh it didn't clean up here and then I complain about that it's also not so why not open all the doors for them but so in this case I shouldn't need to be carefully state what one I think in terms of modeling what is important in all structural features is what does one think is actually going on when looking at the patterns so here in the Chinese case I already have my problems I look at the pattern which I like to show but then what does it mean what does it mean I know and I ask some linguists do you have an idea if you see these terms and other linguists of Chinese linguistics I don't know we don't have an idea on how change of compounds goes on so is there an idea on what is going on more for but what is interesting there one doesn't have an idea one could maybe use this to create hypothesis so for example fixing the three as they say so having it phylogeny and then look in what one finds and look in does it make sense what the algorithm finds as patterns because here nobody could tell me nobody's ever looked at lexical change in terms of come out not nobody but really rare the amount of people have been doing that I think for morphosyntactic patterns typological features the same so it depends ideally one would just try to also maybe ask linguists or maybe discuss do we actually know the direction between these two things and what does it mean and yeah involves there's a feature small vowel inventory medium vowel inventory and large vowel inventory and what does it mean in terms of change when you plot it on a tree so the people would say maybe it needs to go from medium via medium from small to large but then some linguists have their objections against that and give always examples this tree is based on the intuition of Laurent Saga so in this case I again I took the trees that are already there so it was just proving so it was not that I proved the point that we can reconstruct better trees but I just took the trees in order to see how the patterns evolved then to show that even we can model the evolution of the patterns the next step would be to use this method if it turns out to be good to reconstruct trees itself but here it's actually based on Laurent Saga who has some idea and some innovations so there's not a computer there no the computer filled out the slots in order to say what was present in which point so which word was which tree came up best Laurent I mean that is just yeah so actually I keep trying to I keep trying to show that the tree maybe has I shouldn't say that on YouTube maybe I'm really independent there but of course I'm glad if I find something that confirms tree but to be honest so far it's always Laurent Street that turns out best and it's different data sets that I've been using in another study where what I did it was had an incredible impact and actually there are things we cannot handle in trees at all because we have this language as in Germany would say the Dachsprache that is like roofing all the other languages like this roof language and I mean you find when comparing data sets from the 50s on a dialect so I find that for example in Shanghai the word for something like Niddo I think don't really get the tones in the process but nowadays it's a mattery word what they use now so they have been replacing their use post words now and you find lots of doublets in the data already know so it's really during the last 40 years or 50 years things have been changing incredibly there and why should that be maybe even different in the past so let's say 1800 where they had different overarching languages or languages that were more present so this is a problem and this is also between the next step when we can handle the lexical change it goes on but then we could actually search for patterns that show us something that we do not expect in this case I would for example look for the case what does it mean that we have a parallel shift going on here in the languages is it that the tree is wrong is it that these languages invented it and as you say it's like actually it's a normal Mandarin word or more or less like a Mandarin including the Xiang dialects so it's more like we did a Northern dialect but they have some influence so it's quite likely that we have a borrowing going on here a discussion of structural competency there's a paper by Calvin Botkins called something like problems and pseudo problems and syntactic reconstruction which I found a very nice meditation of inherited structural patterns which you can do in your European he says look at an archaic context and abnormal syntax in archaic context so something like funeral orations or athletic competitions these are nice and then if there's a syntactic structure that occurs in those contexts in two languages which is abnormals and chronically in that language then you can say that's a good parody I think that's a very nice but it's a lot of work it's more work than assembling a set of features and then mapping their own language because there you need to look for the irregularities in the languages already but I agree with that if you really want to go for the homology or for the cognizy between these elements then we need to go that way and then really decide or when you use trees they serve one purpose and one purpose of the trees that is useful for the history so if we do not know semantic change but we have a set of cognates and then we do not know about the directions the preference direction of semantic change we can try to have a computer figure out what was the most probable solution or what are general patterns that we find there and I think it's useful and in sound change we know most of the time we know the directions we know a P becomes an F more likely then an F becomes a P this is the other possibility then but for that it's important that the things are related so we cannot prove any relationship with that