 Thank you very much. And so today I'll be talking about this rather boastful title of making computers actually use for two historical inquests. And so I would like to start my talk by saying that I'm very happy that I'm talking in this particular seminar because I think with Joey we kind of share a very kind of general like love for the particular. So if we talk about computers and historical linguistics, the first thing that kind of comes to your mind in general for people who have heard anything about it is a paradigm like this. So what I call tree crunching. So like you ask specialists of a language group with already mature tradition of historical linguistics, and you ask them to encode tradition to a couple of binary trees, whether this language has this on or not whether this word in this language is cognate to that word to that language or not and you just put it all into a kind of fellow genetic algorithm and you get a tree and yeah that's nice but I don't think that's the thing that's the most appealing to historical linguists, because I put my basic misgivings for to to this kind of paradigm in three and three subtitles. So first it's, it's the kind of the inverse because in general computers are supposed to help us doing things but now computer is going to run algorithm and we encode things so that computers can read them easily. That's not the happiest thing ever. The second thing is this kind of big conclusion is them, which is why I began my talk with a not to host, which is that the idea that you know historical linguistics is is not an end to itself at a means to a tree which can be compared with genetic trees or large historical movements of people, things like that. Of course those things are extremely interesting. I think everyone spend some time listening to cracks pot series about this kind of things, but that's not actually makes historical linguists fun for us. And that's not makes our fruits of labor useful for people who speak the languages or who, or who identify with those languages. And the third thing is that it relies on traditions for language groups that already had the tradition so it doesn't produce anything new for us. So how about having it the way around so let's have computers and service and let's have computers to help us create and perfect hypotheses that are classical that actually interest us and that are concrete on this specific trade of this specific language or language group and also we want to have and also we wanted to work in a sort of expanded sorry exploratory context so that we computers help us to create new traditions to bring a rigorous historical linguistics to places to language families where it didn't exist before. Okay, so the idea of capo is that of reconstruction assistant. So like what's a word processor. Well, that's what you use to write documents so that like the documents exist inside the word processor and by interacting with a word processor you write your documents. Okay, so CAD is used for technical drawings and designs and proof assistant in mathematics is to use to create computer verify formal proofs. So that's the basic idea of reconstruction assistant so the linguist or the linguists interact with this reconstruction assistant. And it a software it takes us a bunch of inputs which is basically the lexicon of daughter languages, and you gradually elaborate the assistant helps you to elaborate the linguistic reconstruction your hypothesis about the history of this language group. And at a certain point when you think that the thing is good enough it can be output into an etymological dictionary or other kinds of handbooks of the of the language group. Okay, let's say that come historical linguistics is not like the hardest problem ever for for computers. I mean, a lot of historical linguistics is just a string search and replace a lot of other things are just the statistics. So, in principle, it should, we should be using computers everywhere. But of course that's not what happened. And yeah, we basically use computers as well as what processes and with maybe some custom ruby scripts or ruby scripts or Excel magic to to to get something done but it's not a computational. And we have recently a couple of great surveys on the computational and computerized methods and like for every single sub task for every single step in the general workflow of historical linguistics. You actually get very, very good computational and computerized methods. So, what do we need? Well, let's say the technology is already here and what is needed is engineering. So, you just need to, when you have a context like when you actually want to do something and you also actually want, you actually actually want to involve something and see what we can create. So, that's the basic idea. Okay, so let's talk about some of the previous approaches and from in from an engineering point of view which is to say, how can it be made to produce concrete results of a good quality. So, let's focus on one of the most important tasks when we consider historical comparative linguistics, which are that of cognitive assignment or cognitive determination or cognizance or whatever. So, you have a bunch of words and in different languages, and you try to partition them into different cocking sets so this word is cocking it to that word to that word so you put them all together and you call it a cocking it set. I call it cock set but it's the same. So, there are two basic computational approaches to cognitive assignment. The first one is automatic cognitive detection. So, basically it's an algorithm, which is quantitative kind of language agnostic. You just give it an input, the wordless and it tells you which word is cognitive which that's the first idea. The second idea is that of back projection etymology. So, basically you have already you know already the phonological history of the language group pretty well, and you encode them into let's call them programs that reconstruct the every daughter language form into a language form, and you kind of crisscross you triangulate the results together and see what can be fit into caucus with what others. So, automatic cognitive detection has a very long history. The earliest approaches are actually pre computer like you, like you devise a kind of method for three or four research assistants. So, they still had with assistance at the area. And, and you, yeah, you have them crunch words like a computer in a certain manner so you can have, you can gain rough first idea of what kind of caucus can do this. So, as, as soon as the thing gets computerized and have some actual engineering constraints, you, the basic methods rely on actual phonetic similarity which is like kind of no no to historical linguistics for historical linguistics because yeah this is kind of thing that's basically you have an impression bad means bad. So, yeah, Persian bad is cognitive English bad of course the world like this, but those algorithms will pick up those coincidental complex while fail on non obvious cognates. So, what we actually use here is this method developed by your home at least context that which uses a pair wise sequence alignment to closely mimic the behavior of comparative method basically it takes words like assume words that might be cognate are actually cognate. So, so you get a kind of score of correspondence and you use the correspondence score computed from the word this itself to to determine the search a self reference thing is very robust. And it's, most importantly, it fixes this problem over real over reliance on the actual sound value and it's has a robust independent behavior. So, yeah, the, the algorithms, especially the new ones like clicks that are great. So, if you use it on like on some kind of German English Dutch French thing it, it gives 9495% accuracy, but it cannot function as as a whole workflow by itself. So, the question is, what happens afterwards. If you get a bunch of cocksets, how can you, how are you supposed to correct them. So, this problem itself has like two dimensions. The first thing is, of course, our idea that we want to explore new language groups. So, when you develop an algorithm. You think that like there are trained specialists who can have who have this magical power to tell the true cognate from the force. But this magical power is a downstream to a very, very mature tradition of historical linguistics. So, yeah, so basically, if we don't think very deeply about this kind of problems, you're basically asking, having computers, helping people to do things that people that can already do but cannot help people doing things that they don't. And the second stuff is what I will call the Chinese typewriter problem. So, even with linguistic expertise, it's quite non trivial to think of a way to correct cognate judgments. So, this is a very logical kind of inventory of cognate judgment errors. So, over lumping, over splitting and misattribution. I think I tried to talk in like over correction or type one type two errors but it doesn't work like this. So, okay, over lumping is that if the real the true the ground truth has like those three cognates sets but in fact, I've heard them tell you that they are they all belong to one coxet and over splitting and misattribution. So, for over lumping, you need to split one cognate set into several for over splitting you need to merge several coxets together and for misattribution you need to move stuff around moves move stuff between coxets. Okay, that's the problem. So, if you have this, of course, you know that okay here are two blues, the thing you need to put here and here is one red thing you need to put here. But, well, human beings don't have like infinite memory so basically, if you put those things like physically visually simple, adjacent text opposed to each other, it's very easy to see the pattern. But the problem is that like if you actually run the next step algorithm or any kind of algorithm on the actual data you get you get a huge bunch of coxets and and it's just not feasible to to to reassign forms to do this, but across 3000 or 10,000 or more coxets. So, here's the perfect analogy so I once so I took this picture at the one of the libraries in Heidelberg. And yeah, this is this is this marvelous machine called the Chinese typewriter. So, apparently, you have a kind of mechanical cursor, and you need to locate this mechanical cursor on one of the characters and you, and you, and you, and you tie and you use the button and it moves the the this this this type to the to the paper to to to produce characters. Okay, so imagine this but like every type is a coxet which contains huge bunch of words. And you need to match this coxet here to this coxet here. And while looking at all those know it simply doesn't isn't going to work. Okay, so as a conclusion so automatic cognitive detection is a great it generates excellent first drafts of cognitive assignment, but there's no obvious way for human language to correct the result. And let's talk about the second idea that is very crucial to the development of the paper methodology. So, where's what I call back projection etymological dictionaries or in short back projection etymology. So what's the idea. So, okay, it's the idea of Houston, who in, I think it was in the 70s who had this brilliant idea of my producing a computer generated etymological detection of conkin languages. And here's a sample entry from that dictionary. So if you can see. It's, it's, you have the product conkin word with create a fox manominee or driveway. And how do you compile the how how do you how can you make this kind of different with a computer. Well, so our conkin is one of the language groups that has an extremely developed tradition of historical linguistics so like I think back in 20s or 30s people. Yeah, understand how the sound changes very well, even for even for quite difficult language like a manominee. And so, what happens is that you can sense you know the historical phonology almost perfectly you can write programs that back projects. So what's back projects. So, yeah, you take it's a function that takes an actual proto an actual daughter language form and produces the list of possible proto language forms. Okay, you have a back projection. And so you can have computer just to go through all the daughter language dictionaries and unites them. See if this one can be reconciled with that one in terms of proto language form. So it's so from an engineering point of view, it has the unique distinction of the of actually working. So yeah, here is the dictionary. And what works is also kind of easy to explain because when you have like those three kinds of coconut judgment errors. Okay, so basically, you just need to get rid of basically homophones which is overlapping and it doesn't require this kind of matching things together in 3000 or 30,000 item set. Okay. The problem isn't actually what happens afterwards because there is, because you just need to split a coxsets that are already made so it's is afterward the problem lies before so it has the kind of let's say fragile dependency on the perfect historical phonology. So, so yeah, let's get that into a kind of, I think, exaggerated example. So if you just have a group and you write some rules, and which are quite wrong. And yeah, it compares everything and it gets you 50 etima, sorry, 15 etima. And yeah, you have your great etymological dictionary of group X that has 50 that has 15 etima and you have no absolutely no idea what went wrong because it doesn't give you when it fails it fails gracefully. You just have no idea. So, it works only in this very limited context of like scaling up reliable theory of historical phonology to to an etymological dictionary. And but but we are trying to do something else which is this exploratory stuff when you have a group that you don't really know about then what you can do about it. Okay, so while for the algorithms, the challenge lies in how can you find a way to correct the results. And here the challenge lies in the in the need to have a way to for linguists to discover and gradually refine the historical phonology. Okay, so that's the basic idea. So yeah, for both. But if we look at them like from this kind of engineering workflow point of view, they have, they both have this the fatal fatal disadvantage of being a non explain non exploratory non iterative so yeah, you need to already have a perfect group you're working on before you can enlist the help of computer into it. And of course, if you know the group perfectly why do you need a computer. Okay, so let's distill this discussion into the design goals. So, in order to have something that we actually would like would want to use the first goal is what I call optimal human computer cooperation. So yeah, so the ideal is to have the computer working like some kind of very intelligent, unless the assistant, or more like the student and you're the director. And yeah, the student does everything to the ability but then ask you. Okay, so how can I get this so you get to generate the so so you do the interesting this generating hypothesis, giving intuitive judgments and computers do the work like enforcing consistency keeping track of details and that's the ideal. And the second design goal is incremental improvement so yeah so as we have seen for the for the kind of back projecting technology thing so it really needs a good hypothesis of historical phonology. And so, which in most cases we don't have so yeah, the good system should encourage hypothesis to be gradually built and refined a crude set of hypotheses should still reward the language with more and better information so that the scientists can later use those new information gained from the bad hypothesis to make a good hypothesis. And here is a third thing which is a non destructiveness so yeah basically you need to think really hard so that if the human language is something then. Yeah, it shouldn't be thrown away, except when the human linguists themselves think the thing is wrong. Let's talk about our workflow, which is this thing called CAPA computer sets approach to language reconstruction. So the context is that we in this in this framework of the ERC project beyond boundaries and we want to do an etymology reduction of the language languages. Okay, so we have a lot of digitalized data. And, but now we're working on this, working on just finding a way to gracefully let us make this make a first version of the dictionary, I think, from the six languages in the certain parallel lexicon kind of source. Okay, the context is that of monosyllabic languages. So, okay. Okay, so some people think that monosyllabic languages means that most of the words in those languages like our monosyllabic which doesn't exist with which doesn't really exist, I think. But it's still a very important kind of typological type. So, yeah, the morphology is mostly compositional, and the productive morphine boundary coincides with the syllable boundary. And also, so there is little that happens between syllables, either dichronically or saying synchronically. So with monosyllabic languages, basically, it kind of allows us to like forget about words and take syllables at the basic unit as a phonological evolution and etymologicalization. So, yeah, here's an example of like the cognizance for monosyllabic languages. So, for example, brain in marrow is basically head brain and head hairs, head hair. So all those and, and this word hair is like a flesh hair. So, this head thing goes here, this brain goes here and the head hair thing also goes here while the second syllable is going to be etymology, etymologyized on its own in this, in another, a coxet. That's the basic idea of monosyllabic languages. It saves us a lot of work. And yeah, so CAPER is developed for monosyllabic languages with its limitations. So, later, if we're going to make it more useful for other kinds of languages, we need to think more about it. So, the actual document the CAPER workflow allows us to work on is what I call the bipartite hypothesis, which is made of two parts. So you have the coxets and you have the phonological history. So, okay, the coxets are the coxets. So, a word in the general case or a syllable in this monosyllabic case belongs to coxets. Nothing fancy here. And the other thing is what we call the phonological history. So, basically, yeah, you write them in as finite state transducers in the formal language. So, I don't have time to talk about trans, how great transducers are today, but basically it's the idea is that it gets you to write the historical evolution and SPE style, which everybody kind of knows, and also permit and also like automatically permits you to project forwards and project backwards. So, yeah, that's what we need. So, first you. So, the phonological history is actually composed of like two parts. So, we're looking at this example where Proto permission bar gives more time for. So, first you need to define what the what a legal syllable and the proto language is. So, for example, here, while you, yeah, the legal syllable is composed of the initial and the rhyme and the initials are those and the rhyme are R and E. You can see that bar. Well, it's B fits in the initial here and R fits in the rhyme here. Nothing fancy. Then we need to define the individual sound changes that lead from the proto language to more time. So, for example, here's a rule of devising which turns B into P. And here is a rule that turns R into all so that we get actually we get ball from bar. And finally, we have this transducer called more time, which defines not any legal more chancellable but like an perfectly inherited more chancellable, which is a lot is it's leash derived from proto family, excluding loans and energy and other shenanigans. So, yeah, so how do you define it? Well, you take the you take the actual proto syllable you take the proto phonotactics and you apply the sound changes one by one on the surface you apply the devising then you apply at your all. And so here you get essentially an ordered list of sound changes apply to the proto language syllable. So, an interesting point of the caper view of hypothesis of the data will on which the theory on which you actually welcome is that you don't actually do any proto forms. I mean, if you take a look here you might think that, well, the linguists need to write got to stop all and age for the head and it shouldn't work like this because basically the proto forms should not be provided by the human linguist but and third from the phonological history and the daughter language form so we get for example here it means tears. And yeah, so this word reconstructs in HRLC reconstructs to this would be reconstructed to all those different proto family syllables and this would reconstruct to all those and this would So you do some kind of set. The theory operation. In fact, it doesn't always work so yeah, some kind of complex complex heuristics to tell that the the actual reconstruction of the of the whole caucus it is likely to be be. That's how it's supposed to work. So why do we do this, because it permits a kind of rapid iteration of phonological history. If you do any kind of reconstruction like on hand, you, one of, one of the, one of the, and not very Sony but very annoying problem is this kind of thing so like if you, if you change your hypothesis and so if you change your proto language for the And every time you do such a change a lot of the proto language forms need to be replaced. And so yeah basically we just get lazy and like never update your reconstruction except when you have like a complete new vision of how things are supposed to work and then this kind of things which are still quite counterproductive because I think and we will see later that actually the best way to to to make good to make good reconstruction to make good hypothesis is to iterate very quickly on the on the whole theory And so we got transducers we got to the this phonological history and what are they used for well The first insight of the Cape methodology is that you just blindly do a backward projection whenever possible and very often forward prediction and that provides a lot of context to the linguist. So, for example, we get this word here cloudy. So, yeah, the in the inferred reconstruction is zoo X. So, in this, and for the assay with Mao, so for example, the syllable, so is so yeah the computer runs the transducer backwards to do a projection from Zhao and they see that it is it is reconstructed as zoo X so it a chord it is harmonious it conforms with the with the with the projected with the projected reconstruction of the whole coxet so you get to take care and you get different kinds of process. So, here's the kind of process so for example, Mao, and at least according to our current hypothesis, it can only come from do not zoo. So, here it reconstruct this word back reproducing do and tells you that, well, it does have a reconstruction but it's not a reconstruction that conforms to the projected reconstruction of the coxet. And here's another example, which is, which, so if you get a memory, you can get a dope or deep. And yeah, it basically tells you that so you see that here maybe for you, maybe your phonological history is wrong, maybe it is a regular lot because it reflects. And here, of course, what's wrong is not the, it's not the phonological theory but you should move it out to create another coxet. Okay, so that's the basic idea of back projecting everywhere which gives you a lot of context and we'll take a look at that later also. And so the basic organization of the Cape her workflow is basically what you would expect so you process the source word list to a kind of computer readable form. And you get the bootstrapping stage where the next that algorithm generates the first draft of the cognitive assignment. And then you get iterative improvement before deciding that you are going to publish that as an iterative dictionary. So the interesting thing here is the actual iterative improvement. So the idea is that, so basically, since we have this bipartite hypothesis, so you're supposed to, like when you make a better transducer, then it could somehow contributes into the into the coxet editing process to create better coxets and vice versa. So you can have this is exactly you, you work on transducers and then you work on coxets with new stuff you feel very happy and you get your new stuff and you use that back to create better transducers and you feel very happy and yet better yet better until you get the final result. So let's talk about this iterative improvement and both and both parts. So the more interesting part is this that a cognitive assignment. So how are you going to solve those seemingly intractable problems of of creating coxets. So here is this misattribution coxet error. And so, well, I will put it like this, since I put it like this, then, of course, how are you going to correct them dragging them around. Okay, so with back projection, you get a very nice stuff. Let's just take a look. So we just looked at this cloudy word, and maybe something doesn't really belong here so we're going to working on it. So we create a new coxet. So it's just as easy as dragging the word from the old coxet to the new coxet. So let's take a look at how the how the how the reconstruction actually change. Okay, now we put up here since stuff doesn't have reconstruction according to our current theory so there, there's no reconstruction for this coxet, then we drag the other Okay, this one does have a reconstruction which is deep or do so it is inferred that basically this this new coxet likely has reconstruction of deep or do. Okay, so that's the basic idea you allow language to drag words around, and you use transducers to to to provide this most useful kind of context that can be provided to the linguist. Okay. Of course, we just saw that on our board we just have three coxets and have that possible, we have a lot of them. Well, so that's a problem that needs to be looked into very seriously. So, here's what we want. So, first, partial visibility, the idea is that you, you shouldn't let yourself like be overwhelmed with this with the ocean of coxets know it wouldn't work. So, either like you wish to prioritize them. So maybe some are more important more in more reliable or like you'd end up in the smaller etymological lexicon or whatever. But if you don't have that then just give me some some random coxets so that I don't have to work on 330,000 of them at once. The second idea is that of grouping so since we have this over splitting or misattribution relation where we really do need the coxets to be visible to the human eye on the same screen in order for human beings to realize that okay maybe we should track it from here or maybe we should merge two coxets together. So we need to find a way to group together the coxets that are likely to have an over splitting or misattribution relation. Okay, here's what we do. So, first, we have a criterion called a reliable coxets. So the idea is that, well, since in the etymological dictionary, usually word netimon is justified by more than one example so yeah, at least it wouldn't do as any harm if we just declare that we just we only want to see the coxets which already has a reconstruction which is justified by words from more than one language. And the second idea is that of course so basically you take a reliable coxets and you also you try to see if there are other coxets reliable or unreliable that has the same that can have the same. That can have the same reconstruction. So you make an equivalence class you just partition the coxets into different boards where the boards are those where the coxets in the boards just share one or more reconstructions. So, what does that do well in doing this with transducers you make only one small part of coxets visible, and also since they're visible in board, you every board contains coxets that are very likely to have an over splitting or misattribution relations and since they since some of their words already share reconstruction so yeah they may be very very very related. Okay, boards also offers us this task management benefit of chunking it up to manage what it's so that if we take a look at some lovely boards. Okay, here's a board. Here's a board with a lot of coxets, but not that lot. Here's another board. Notice that for example, here is a coxet that is reconstructed to band here's another and some of the, and some of the glosses are all we all or whole so yeah they should be much together. Here's a rather large coxet is a rather large board. We can merge things together. Here's a very good board with which contains basically only one etymon. However, this, there is one word that doesn't look the same which is this yeah this yeah it's just a word for water but the, but the algorithm failed to understand that the year is not the same as the other keywords so let's move it out of it. And here we get a word that is reconstructed as big and that means to go to battle. So, and here you get the same meaning but the the actual the actual reconstruction, even the superficial form kind of allows you to make this decision. And of course the fire short is the related idea to go to battle so you can merge all those coxets together and let's take a look at so this is the, I'm not really working here. This is just the results of the first fishing is you don't have a lot of things here. And yeah. Okay, that's the basic idea of it. So, this process of using transducers to put board to put coxets into visible boards we call fishing and why do you call it fishing because basically with every improvement on the phonological history on the transducers. It doesn't happen, which makes more reliable coxets visible to us. So, if you do a new bad transducer for a new language or you improve and transducer for language already has one, then either more coxets are just as reliable, or some little small coxets actually have a reconstruction and can be attached to another reliable coxet. That's the basic idea. So we have this little nice fish icon to to indicate the new cost of that's just when that just got fished up. Here's how the fishing works. So we color code everything in this way so this gray box represents the results of next that algorithm, which is a huge bunch of possible coxets waiting to be corrected. So somehow I'm being very hand wavy here. You book yourself up and put together a rough first draft of historical phonology. And with fishing, you actually fish out a lot of coxets into this red portion, which I call visible or boarded. Then with the cognate assignment interface, the one that we just seen a lot of videos of human linguists can edit the first bunch of body coxets into human curated high quality coxets. Then you get better hypotheses of historical phonology. And it fish out coxets with this these better historical phonology. And you all edit them again using the interface. So that's the idea. So with this fishing thing you basically finally have an incremental kind of way out from the from the conundrum of too many coxets. Okay, here's the interesting thing that I wish to talk about a bit with the wrecks or basically because your first transducers are not perfect. They are going to pick up some things that with later transducers, the fishing algorithm thinks that they shouldn't belong on the board. So they get assigned this acute dead fish icon. And if we take a life, if we take a natural look at those dead fish things, it's very interesting. So for example, this means tile like the tile you put on the rooftop of houses. And the first thing you notice that is that they look a little bit too similar for for actual cognates. Maybe this should be all or maybe that should be. But no, they are why va va. And the second thing is that they they receive quite disparate. So this one is why with the zero tone. And this one is why with each tone and this one doesn't have reconstruction it isn't inherited looking syllable. So what happens of course if you speak Chinese you know that it's, it's basically the the Chinese word for tile so with. So you see basically that by using algorithms and transducers together it also gives it also weeds out the lens if you're interested in the inherited part of the next. Okay, so here's the idea so we're trying to build an iterative workflow which allow you to work on one of the two parts of the bipartite hypothesis. And when you get better transducers, it can help you to fish coxies and also transducers kind of contribute directly to the UI so you are helped and spoiled by the transducers to create better coxies, which maybe can be used to create better transducers. So this, this interface and this efficient workflow basically combines the two approaches previously discussed that automatic cognitive detection that have back projection etymology. So, so for the math type, you know, if you do math or if you do math or physics, you know the idea of like one object devolved into degrades into another in an extreme context. So if you imagine that the cognitive detection algorithm says that every word is not cognitive with every word it will with any word and it's basically degrades into back projection etymology and it's basically both but of course it's the actual interesting thing. So it kind of combines the advantages of both methods, especially with regards to the scope of data. So, automatic cognitive detection algorithms are not really fancy about a small imperfections, but on the other hand it misses non obvious cogniz the kind that gives you aha. I remember like learning I think last year it's really ridiculous that I hadn't done last year that enough is genug it's, it's, it's trivial when you think of it but if you know the historical phonology but it wouldn't click and then wouldn't click to the algorithmizer and back projection etymology gives you those aha's but if there is a little imperfection either in the actual forms or in your version of the historical phonology, then it's just, it doesn't doesn't appear. So, on the one hand, we don't want this over fascinates making data invisible. But on the other hand, we do want to know if modern reflexes are a lot because it's not. So, the Cape awake a by putting the human linguist in the center of this approach provides both kind of information so yeah it doesn't give the kind of information that's the algorithms give and also the kind of information, the, the back really strict rigorous etymological method and then it gives you all this information and it lets you to decide what to do with it. Okay, now let's talk about transducers. So the thing with transducers is that, well, while you can have those like basic stuff. You have the algorithm creating the caucus and you can correct the caucus for transducers you need to create things X and the hello and which makes the situation quite frustrating. So, here's what we want to first you to create your, your historical knowledge that you need to something to work on, you need your words that you think maybe are cognitive with each other, and you need a visualization mechanism to to give your patterns to make you think about things. And also, since transducers, or any kinds of hypothesis of historical knowledge are unmanageable, fragile jagger notes you do need some way to decompose it into discrete tasks of manageable size within some feedback so you know you did something good and not bad. Okay. So, we work from correspondence patterns. So, the old masters of historical knowledge, of course, thinks that the correspondence are the real things and the reconstruction so may call them here restitution. They're not real. They're just some kind of algebraic notation of the real correspondence patterns. Okay, it doesn't really work like this, but it gives us a quite good idea of like, it's important in the whole strategy. So, in the context that we kind of where we are from so for the linguistics of China or Southeast Asia or other sign of Tibetan languages. There is this kind of thing of putting actual attested forms into correspondence pattern tables. So, here's an example so here is the reconstruction of one for a proto meow and you get this proto, proto initial. And so, yeah, you get you get you really get a, you know, he tries to be exhaustive you put all those words here. And so you can see that they correspond to each other. And here's an, here's an less exhaustive example is a diabolical reconstructing prototype so you get four legs and five words but the basic idea the same. Also, the idea that of marking of explicitly marking non regular correspondence. So, for example, here it says that it says that the tone isn't regular. It says that the vowel isn't regular. Okay, in the most general case, generating correspondence patterns from caucus is generally hard problem because basically to do this kind of pattern to need alignment, which means that you need to take literally and that they and that the and layer and kind of fit them together into into this kind of table where you can say that you have Spanish L which corresponds to Italian L which corresponds to a Romanian L which corresponds to French L and etc etc. However, first we know that we don't really need the actual, let's say the actual algebraic expression of the correspondence patterns so we need them mostly as a visualization tool so I think we can argue that nothing which corresponds with T which corresponds with P which corresponds to nothing completely misses the point. And so, but if you, if you don't look at this, but if you use this thing as, as a kind of index to generate this kind of China or Southeast Asia kind of complete tables with examples, it, it makes the pattern extremely clear so yeah, even if it's nothing to T to P to nothing which is completely as a line, but we look at this and we, and we get a lot of ideas on where this might come from. Okay, so the actual alignment is not important. And of course, at least for the current incarnation of the paper methodology we're working with monosyllabic languages so we do a very trivial thing alignment by phonotactics. So yeah, this is like, this is like a phonology, this is like a linguistic theory of 1000 AD and you just divide your syllable into initial medial nucleus, coder and tone. And so yeah, you, for every language, the, the language right to kind of pass, which passes this thing into, into the schema. And also the schema is combined this way so initial means initial plus media and then you get media and you get rhyme which is media plus nucleus plus coder and you get tone. So yeah, the important thing is that you see this is, you don't take, you don't need to take alignment as some kind of very, very fine thing but as a tool to help you to see the actual patterns. Okay, so here's how it works. So, for example, we have those languages. So we're going to study at sea. The interface is very primitive, but it's me doing the programming so how can I be otherwise. Okay, so you get the correspondence you waited a bit for to get the report and so it gives you a title chapter titles initials rhymes and you give those you get those rhymes with everywhere, you get the actual forms and also the back projections. Let's turn two languages into three languages. So we'll still need to wait a little bit to get the actual results. Here they are. Okay, let's look at rhymes again. There is absolutely no interface. So yeah, you just search around in this document. Okay, so yeah. Here is a turn correspondence. Okay, that's the basic idea. And with back projection is a really great example of how back projection really helps you in reconstructing stuff. So for example, we see this, we see this correspondence pattern. So, I try and send out correspond to the in a T and the in my room. So for example, for chicken, cock, hen or the same thing, but yeah, they should be much together, but yeah, there are basically two examples. And so for bark, it's good job, and what kind of rain, and this one is good job for a little while. So yeah, when you just take a look at this, then good job and of course, it's really difficult to think of any hypothesis that can like explain them. But of course, you don't need to do that. So, the idea is to enable you to do this, a gradual refining when you create hypothesis on the largest, the most regular stuff. And what did that give you. Well, after you created them and you take a look here. Okay, so basically it's really easy to see that you have, you have a grouping, which makes our time and send out to have a good while have a proto language, well, I see and module have proto language. So yeah, basically for those two words and probably others, you get a good in some languages and in some other languages and you can and you can like make more sophisticated hypothesis from here. And this is this is marvelous because like in the real world, you, you need to, you need to basically maybe someone reconstructs bad proto language and you read this article and you write another article is saying that okay, how are you so stupid and missing this obvious correspondence pattern. But of course, when when your whole mind is taken by by by those kind of very basic laborious work of collecting correspondence pattern, you can't really see that and here you can really evolve the hypothesis in a very, in a very effortless manner. Okay, so the next problem is debugging. So yeah, you have ordered sound changes and you have feeding them bleeding and whatever. So yeah, it's like the Chinese saying you you you you you make a change to one piece and the whole thing moves. So yeah, this in itself would be very difficult. So the idea is to is to factor the whole change into tiny tiny atomic changes and in a comparative manner. Okay, so I think it's clear with an example. So let's continue our video. We're still studying at Ebola and Mario and now I'm looking for. Now I'm looking for those cross marks those cross marks mean say that for example, the Norwood here doesn't even cannot come from a proto Burmese etymon according to the present theory of the of the historical knowledge of Mario. So yeah, I'm I'll be looking for those kind of things. I look for crosses crosses crosses crosses crosses crosses crosses crosses crosses crosses. And here's a good one. Okay, so if you look at it, then there are two crosses here. So you have a whole correspondence pattern for which the con theory doesn't work for any of the forms. So for far and for rely the the Mario form does not correspond the actual assess it Mario from doesn't correspond to anything. And for to measure it corresponds to something but you see, in every other languages care wise we have some possible so here is a problem. So what's the problem here. You have this, you have this air vowel and proto Burmese, which gives a in at sea and air in Bola is the same vowel is the same about everywhere in at sea and Bola a and air. However, in Mario, sometimes it's air and sometimes it's our so earlier we have like forgotten and we didn't see that some vowel some of the air vowels could become our and what's the condition. Okay, so the busher, no, and the gun. Okay, here is it. So basically, and you have the last and the, well, is the, I think the the the sound change from proto Burmese word to Mario, it could be quite, could be quite recent. So the idea is that maybe the proto Burmese vowel air changes into our after dealers, including the labial villa approximate. Okay, now we found our problem. So, for demonstration purposes, purposes, I start with, I start with inventing a very bad solution. So here's a bad solution. So first, here's a sound law that changes air into our mechanically, mechanically without taking into account the phonetic conditioning. Also, the placement of the, also the placement of the, the sound law is wrong. The sound changes wrong. So you get air to our here. But of, but here is is an already existing example and Mario, an existing sound training Mario where our changes into all. So, of course, we can, we can, we can predict that our bad version changes into our which then get changes in all which is what we want. Okay. So what we get. Well, for the words that we're actually interested about, we get nothing. So, what did not have a reconstruction. So for example, far doesn't have English now neither. So, yeah, the only, the only difference is that it was the wrong is the theoretical reconstruction was there but in current theoretical reconstruction the wall. However, and here we see the beauty of the Capeham interface. So, yeah, basically, you have two rows. The first row corresponds to the old transducer and the second row correspond to the new transducer. And it checks if your, if your new transducer actually explains the current forms better or worse than the old transducer. In this case, it's worse, and you get, and of course, since it's a systematic degradation of the, of the quality of the hypothesis you get a huge swath of red with frowning faces across a whole correspondence pattern which is capable of telling you that something went terribly wrong. So now we are reminded of our historical linguistic 101 and the idea that, well, some laws have consequences. And so we have, so we put it into a better place. And we get the smiley faces telling us that, okay, now the actual testiform var is properly reconstructed to the pan-Bermish etymon where, hooray, however, of course, you still get this huge swath of red, sad stuff. Okay. Now we are going to do things seriously. So we remember that it's okay. It goes after Vilas. So what are the Vilas, KG, and we remember there is this W and maybe aspiration. So yeah, it happens after Vilas. Wonderful. We have, okay, what happened, gosh, okay. So yeah, it did it right for far and measure, but not for rely. Why? Because of course, yeah, rely has this nir. And yeah, you see, yeah, I meant to do this thing wrong in the beginning for demonstration purposes, but now it caught me actually in an actual error. Okay, here's the right version of the sound law. And okay. Okay, so yeah. And yeah, I should feel very happy about this, this kind of thing, because when you get like a whole correspondence pattern with smiley faces and green, then yeah, then it's a new situation, a new sound change that is now properly accounted where it wasn't before. However, we still need to change, we still need to check if it doesn't make anything worse. So you do. Okay, because it's very bad. I type the frowning face into search. And search tells me that there is no frowning face in the thing. Okay, so I'm very happy about my new sound law in Bola where air changes into R after Vilas. Okay, so the discussion so yeah for the for the transducer debugging part. Well, so first you get this visualization of the China SCST style forward correspondence charts, which provides the necessary and minimal information for generating the correct hypothesis. And the next thing is incrementality so basically yeah it's very dangerous to change transducers so we have this kind of atomic change. So if and report with smiling and frowning faces, a minute loop which ensures that if you are going to do a trade off you, you know that you're doing a trade off. And if you do an improvement is an improvement and if it's not it's not. The third thing is iterative iterativeness iterativity. So yeah, the, what do coxies do in this path well when you have better coxies you get more examples. For example, I think the reason that we didn't catch this art is that there were two few coxies in the first version of the fishing so so when you have one example you're not really sure. one example you are not really sure if something needs to be done about it and also when they're checked when they're properly um when when you when you when you get the rubbish out of any out of a computer algorithm generated uh coxet you also get less red herrings and you get just a much uh much clearer picture of the of the historical phonology yeah so that's basically it so yeah we just talked about the uh transducer debugging as a part of an iterative workflow so when you get better coxets they can they can generate a good correspondence patterns for you to um debug your transducers to make improvements in your hypotheses in in the iterative and instant feedback way that's basically it so yeah uh yeah so sorry for this thing being a bit a bit technical but I think if we're really going to talk about how to make computers actually useful for historical linguistics we'll need to see from the beginning to the end how you can uh how you can make this uh this workflow uh consistent with itself and iterative and stuff okay that's it and so uh future directions okay so yeah we're still working on the current incarnation of the uh if the k-perme methodology so um with a view of producing a really good um entomological dictionary with the language languages and uh frankly I don't program that well so yeah but yeah just just an idea so um putting it more seriously so we see that a lot of things are greatly simplified by the fact that we are working on monosyllabic languages so the um basically the thing is that uh the first the alignment problem which is really hard is simplified so in order to address the alignment problem I think we'll at some point need to um make much better transducers than than the kind of you know it's like a peak it's like a parallel parallel regular expressions versus real regular expressions as long as you have this bi-directionality I think we should actually explore ways to to put more stuff into it and also uh since we since with monosyllabic languages we don't really care about like uh paradigmatic morphology or analogy which um which which destroys the perfect love to get that uh relationship between the inflected form in the daughter language and the inflected form in the source in the proto language and yeah uh first we need to think about um like even encoding them in the in the rigorous and machine readable and machine verifiable and way is an open problem so let's end let me end this talk um by saying some good words about transducers so I think at some point with this and other technologies so it can it can bring us to uh to something that is uh in French we say democratiser which in English is like de gatecaped so um for Indo-European linguistics you you you get an actual discipline and people work on it so yeah so people talk in the um laziest uh most symbolic way possible but for them it's it's doable but like for for minority languages we need to be very frank and I don't think we're we're getting we're getting we're getting we're getting 300 professors beckering with each other on on one tiny language family um and and the in the jungles of north Laos it doesn't work like this so so what we can do is to is to make is to make our hypothesis transparent and um so easy to change and easy to understand and so uh kind of uh kind of less institutionalized less uh intense cooperation could still uh move us towards a uh more recordable future for the historical linguistics of world languages okay that's it uh thank you very much