 Thank you very much. And so today I'll be talking about this rather boastful title of making computers actually use of two historical linguists. And so I would like to start my talk by saying that I'm very happy that I'm talking in this particular seminar because I think with Jerry we kind of share a very kind of general like love for the particular. So if we talk about computers and historical linguistics, the first thing that kind of comes to your mind in general for people who have heard anything about it is a paradigm like that. So what I call tree crunching. So like you ask specialists of a language group with already mature tradition of historical linguistics, and you ask them to encode tradition to a couple of binary trees, whether this language has this or not, whether this word in this language is cognate to that language or not. And you just put it all into a kind of fellow genetic algorithm, and you get a tree. And yeah, that's nice, but I don't think that's the thing that's the most appealing to historical linguists, because I put my basic misgivings for to this kind of paradigm in three and three subtitles. So, first, it's, it's the kind of the inverse because in general computers are supposed to help us doing things but now computer is going to run algorithm and we encode things so that computers can read them easily. That's not the happiest thing ever. The second thing is this kind of big conclusion is then which is why I am begin my talk with a lot to host, which is that the idea that you know historical linguistics is is not an end to itself, it means to a tree which can be compared with genetic trees or large historical movements of people, things like that. Of course, those things are extremely interesting. I think everyone spend some time listening to crackpot series about this kind of things. But that's not actually makes historical linguists fun for us. And that's not makes our fruits of labor useful for people who speak the languages or who, or who identify, originally originally with those languages. And the third thing is that it relies on traditions for language groups that already had the tradition so it doesn't produce anything new for us. So how about having it the other way around. So let's have computers and service the human beings and let's have computers to help us create and perfect hypotheses that are classical that actually interest us and that are concrete on this specific trade of this specific language or language group. And also, we want to have, and also we wanted to work in a sort of explanatory, sorry, exploratory context, so that we computers help us to create new traditions to bring rigorous historical linguistics to places to language families where it didn't exist before. Okay, so the idea of capo is that of reconstruction assistant. So like, what's a word processor. Well, that's what you use to write documents so that like the documents exist inside the word processor. And by interacting with a word processor you write your documents. Okay, so CAD is used for technical drawings and designs and proof assistant in mathematics is to use to create computer verify formal proofs. So that's the basic idea of reconstruction assistant so the linguists or the linguists interact with this reconstruction reconstruction assistant. And it as software it takes a bunch of inputs which is basically the lexica of daughter languages, and you gradually elaborate the assistant helps you to elaborate the linguistic reconstruction your hypothesis about the history of this language group. And at a certain point when you think that the thing is good enough it can be output into an etymological dictionary or other kinds of handbooks of the of the language group. Okay. Let's say that come historical linguistics is not like the hardest problem ever for for computers. I mean, a lot of historical linguistics is just a string search and replace a lot of other things are just the statistics. In principle, it should. We should be using computers everywhere, but of course that's not what happened. And yeah, we basically use computers as well as what processes and yeah, so with maybe some custom ruby scripts or ruby scripts or Excel magic to to to get something done but it's not a computational. And we have recently a couple of great surveys on the computation and computerized methods and like for every single sub task for every single step in the general workflow of historical linguistics. You actually get very, very good computational and computerized methods. So what do we need? Well, let's say the technology is already here and what is needed is engineering. So you just need to when you have a context like when you actually want to do something and you also actually want you actually actually want to involve computers in the thing and see what we can create. So that's the basic idea. Okay, so let's talk about some of the previous approaches and from in from an engineering point of view which is to say how can it be made to produce concrete results of a good quality. So let's focus on one of the most important tasks when we consider historical comparative linguistics, which are that of cognitive assignment or cognitive determination or cockiness or whatever. So you have a bunch of words and in different languages, and you try to partition them into different cockiness sets so this word is cockiness to that word to that word so you put them all together and you call it a cockiness set. I call it cockset but it's the same. So there are two basic computational approaches to cognitive assignment. The first one is automatic cognitive detection. So basically it's an algorithm, which is quantitative kind of language agnostic. You just give it an input, the wordless and it tells you which word is cognitive which that's the first idea. The second idea is that of back projection etymology. So basically you have already, you know already the phonological history of the language group pretty well, and you encode them into let's call them programs that reconstruct the every daughter language form into a proto language form and you kind of crisscross you triangulate the results together and see what can be fit into caucus with what others. So automatic cognitive detection has a very long history. The earliest approaches are actually pre-computer like you devise a kind of method for three or four research assistants. So you still had a research assistant at the area and you have them crunch words like a computer in a certain manner so you can have a rough first idea of what kind of cockiness can exist. So as soon as the thing gets computerized and have some actual engineering constraints, the basic methods rely on actual phonetic similarity which is like kind of no-no to historical linguistics for historical linguists because this kind of thing that's basically you have in person bad means bad. So yeah, person bad is cognitive English bad, of course the world doesn't look like this, but those algorithms will pick up those coincidental local likes while fail on non-obvious cognates. So what we actually use here is this method developed by Yohamati's list context that which uses a pair-wise sequence alignment to closely mimic the behavior of comparative method basically it takes words like assume that words that might be cognate are actually cognate. So you get a kind of score of correspondence and you use the correspondence score computed from the wordless itself to determine the self-referencing is very robust and it's most importantly it fixes this problem of overreliance on the actual sound value and it has a robust independent behavior. So yeah, the algorithms, especially the new ones like clicks that are great. So if you use it on some kind of German, English, Dutch, French thing, it gives 94, 95% accuracy, but it cannot function as a whole workflow by itself. So the question is what happens afterwards. If you get a bunch of coxsets, how can you, how are you supposed to correct them? So this problem itself has like two dimensions. The first thing is of course our idea that we want to explore new language groups. So when you develop an algorithm, basically you think that like there are trained specialists who can have, who have this magical power to tell the true cognate from the force, but this magical power is downstream to a very, very mature tradition of historical linguistics. So yeah, so basically if we don't think very deeply about those kind of problems, we are basically asking, having computers, helping people to do things that people can already do, but cannot help people doing things that they don't. And the second stuff is what I will call the Chinese typewriter problem. So even with linguistic expertise, it's quite non-trivial to think of a way to correct cognate judgments. So this is a very logical kind of inventory of cognate judgment errors. So over-lumping, over-splitting and misattribution. I think I tried to talk in like over-correction or type 1, type 2 errors, but it doesn't work like this. So over-lumping is that if the real, the true, the ground truth has like those three cognates sets, but in fact, algorithm tells you that they all belong to one coxet and over-splitting and misattribution. So for over-lumping, you need to split one cognate set into several. For over-splitting, you need to merge several coxets together and for misattribution, you need to move stuff around, move stuff between coxets. Okay, that's the problem. So if you have this, of course, you know that, okay, here are two blue thingy you need to put here and here is one red thingy you need to put here. But well, human beings don't have like infinite memory. So basically, if you put those things like physically, visually simple, adjacent, juxtaposed to each other, it's very easy to see the pattern. But the problem is that like if you actually run the lex-dat algorithm or any kind of algorithm on the actual data, you get a huge bunch of coxets and it's just not feasible to reassign forms to do this, but across 3,000 or 10,000 or more coxets. So here's a perfect analogy. So I took this picture at one of the libraries in Heidelberg. And yeah, this is this marvelous machine called the Chinese typewriter. Apparently, you have a kind of mechanical cursor and you need to locate this mechanical cursor on one of the characters and you use the button and it moves this type to the paper to produce characters. So imagine this, but like every type is a coxet, which contains a huge bunch of words. And you need to match this coxet here to this coxet here. And while looking at all this, no, it simply doesn't work. Okay, so as a conclusion, so automatic cognitive detection is a great, it generates excellent first drafts of cognitive assignment, but there's no obvious way for human language to correct the result. Okay, let's talk about the second idea that is very crucial to the development of the CAPA methodology. So where's what I call back projection etymological dictionaries or in short back projection etymology. So what's the idea? So, okay, it's the idea of Houston, who in, I think it was in the 70s who had this brilliant idea of my producing a computer generated etymological dictionaries of Gonkan languages. And here's a sample entry from that dictionary. So if you can see. It's, it's, you have the product or Gonkan word with Cree, Fox, Menominee, or Jibway. And how do you compile? How can you make this kind of dictionary with a computer? Well, so, how Gonkan is one of the language groups that has an extremely developed tradition of historical linguistics. So, like, I think, back in 20s or 30s, people, yeah, understand how the sound changes very well, even for, even for quite difficult language like Menominee. And so what happens is that you can, since you know the historical phonology almost perfectly, you can write programs that back projects. So what's back projects? So, yeah, you take it's a function that takes an actual proto, an actual daughter language form and produces the list of possible proto language forms. Okay, you have a back projection. And so you can have computer just to go through all the daughter language dictionaries and unites them, see if this one can be reconciled with that one in terms of proto language form. So it's, so from an engineering point of view, it has the unique distinction of actually working. So yeah, here is the dictionary. And what works is also kind of easy to explain because when you have like those three kinds of cognitive judgment errors. Okay, so basically, you just need to get rid of basically homophones, which is overlapping and it doesn't require this kind of matching things together in the 3,000 or 30,000 item set. Okay, so the problem isn't actually what happens afterwards because there is, because you just need to split a coxsets that are already made. The problem lies before. So it has the kind of, let's say, fragile dependency on the perfect historical phonology. So, yeah, let's get that into a kind of, I think, exaggerated example. So if you just have a group and you write some rules, and which are quite wrong. And yeah, it compares everything and it gets you 50 etima, sorry, 15 etima. And yeah, you have your grade etymological dictionary of group X that has 50, that has 15 etima and you have no absolutely no idea what went wrong because it doesn't give you when it fails it fails gracefully. You just have no idea. So it works only in this very limited context of like scaling up reliable theory of historical phonology to an etymological dictionary. But we are trying to do something else, which is this exploratory stuff. When you have a group that you don't really know about, then what you can do about it. Okay, so while for the algorithms, the challenge lies in how can you find a way to correct the results. Here, the challenge lies in the need to have a way to, for linguists to discover and gradually refine the historical phonology. Okay, so that's the basic idea. So, yeah, for both, if we look at, if we look at them like from this kind of engineering workflow point of view, they have, they both have this fatal, fatal disadvantage of being non-exploratory, non-iterative. So, yeah, you need to already have a perfect knowledge of the group you're working on before you can enlist the help of computer into it. And of course, if you know the group perfectly, why do you need a computer? Okay, so let's distill this discussion into the design goals. So, in order to have something that we actually would like, would want to use, the first goal is what I call optimal human computer cooperation. So, yeah, so the ideal is to have the computer working like some kind of very intelligent, enlist assistant, or more like the student and you're the director. And yeah, the student does everything to the ability, but then ask you, okay, so how can I get this? So you get to generate the, so you do the interesting bits, generating hypothesis, giving intuitive judgments and computers do the work like enforcing consistency, keeping track of details and current dictionaries. That's the ideal. And the second design goal is incremental improvement. So, yeah, so as we have seen for the kind of back-projecting etymology thing, so it really needs a good hypothesis of historical phonology. And so, which in most cases we don't have, so yeah, the good system should encourage hypotheses to be gradually built and refined. A crude set of hypotheses should still reward the linguist with more and better information so that the linguist can later use those new information gain from the bad hypothesis to make a good hypothesis. And here is a third thing which is a non-destructiveness. So yeah, basically, you need to think really hard so that if the human linguist does something, then yeah, it shouldn't be thrown away except when the human linguists themselves think the thing is wrong. Okay, so yeah, let's talk about our workflow, which is this thing called CAPER, a computer set to print a language reconstruction. So the context is that we, in this framework of the ERC project Beyond Boundaries, and we want to do an etymology reduction of burnished languages. Okay, so we have a lot of digitalized data. And, but now, yeah, we're working on this, working on just finding a way to gracefully let us make this, make a first version of the dictionary, I think, from the six languages in a certain parallel lexicon kind of source. Okay, the context is that of monosyllabic languages. So, okay. Okay, so some people think that monosyllabic languages means that most of the words in those languages like half are monosyllabic, which doesn't exist, which doesn't really exist, I think. But it's still a very important kind of typological type. So yeah, the emphology is mostly compositional, and the productive morphine boundary coincides with syllable boundary. And also, so there's little that happens between syllables, either diachronically or synchronically. So with monosyllabic languages, basically, it kind of allows us to like forget about words and take syllables at the basic unit as a phonological evolution and etymologicalization. So, yeah, here's an example of like the cognizance for monosyllabic languages. So, for example, brain in Maro is basically head brain and head hairs, head hair. So all those and, and this word hair is like a flesh hair. So this head thing goes here, this in brain goes here and the head hair thing also goes here while the second syllable is going to be etymology, etymology on its own in this, in another a coxet. That's the basic idea of monosyllabic languages. It saves us a lot of work. And yeah, so paper is developed for monosyllabic languages with its limitations. So later, if we're going to make it more useful for other kinds of languages, we need to think more about it. So the actual document the Cape workflow allows us to work on is what I call the bipartite hypothesis, which is made of two parts. So you have the coxets and you have the phonological history. So, okay, the coxets are the coxets. So a word in the general case or a syllable in this monosyllabic case belongs to coxets. Nothing fancy here. And the other thing is what we call phonological history. So basically, yeah, you write them in as finite state transducers in the formal language. So I don't have time to talk about trans how great transducers are today. But basically, it's the idea is that it gets you to write a historical evolution in SPE style, which everybody kind of knows, and also permit and also like automatically permits you to project forward and project backwards. So yeah, that's what we need. So first you. So the phonological history is actually composed of like two parts. So we're looking at this example where Proto permission bar gives more time for. So, first you need to define what the, what a legal syllable and the proto language is. So for example here, while you, yeah, the legal syllable is composed of the initial and the rhyme and the initials are those and the rhyme are R and D. So you see that bar, well it's B fits in the initial here and R fits in the rhyme here. Nothing fancy. Then we need to define the individual sound changes that lead from the proto language to more time. So for example, here's a rule of devising which turns B into P. And here is a rule that turns R into all so that we get actually we get ball from bar. And finally, we have this transducer called more time, which defines not any legal more chancellable but like an perfectly inherited more chancellable, which is a lot is it's leash derived from Proto by me though Proto by mesh, excluding loans and energy and other shenanigans. So, yeah, so how do you define it? Well, you take the, you take the actual Proto syllable, you take the proto phonotactics and you apply the sound changes one by one on the surface. You apply the devising, then you apply it at your own. And so here you get essentially an ordered list of sound changes applied to the proto language syllable. So, an interesting point of the caper view of hypothesis of the data will on which the theory on which you actually welcome is that you don't actually do any Proto forms. I mean, if you take a look here, you might think that, well, the language you need to write grotto stop all and age for the head and it shouldn't work like this because basically the proto forms should not be provided by the human linguist, but and third from the phonological history and the daughter language form. So we get, for example, here it means tears. And yeah, so this word reconstructs in HRLC reconstructs to this word be reconstructed to all those different proto by mesh syllables and this word reconstructed all those. So you do some kind of set a theory operation. In fact, it doesn't always work. So yeah, some kind of complex, complex heuristics to tell that the the actual reconstruction of the of the whole coxet is likely to be be. Okay, that's how it's supposed to work. So why do we do this, because it permits a kind of rapid iteration of phonological history. So, if you do any kind of reconstruction, like on hand, you one of one of the, one of the, and not very Sony but very annoying problem is this kind of thing so like if you if you change your hypothesis and so if you change your proto language for the tactics, and every time you do such a change, a lot of the proto language forms need to be replaced. And so yeah, basically, we just get lazy and like never update your reconstruction except when you have like a complete new vision of how things are supposed to work and then this kind of things which are still quite counterproductive because I think. And we will see later that actually the best way to to to make good to make good reconstruction to make good hypothesis is to iterate very quickly on the on the whole theory. And so we got transducers we got to this from the logical history and what are they used for well. The first insight of the CAPE methodology is that you just blindly do a backward projection whenever possible and very often forward prediction and that provides a lot of context to the linguist. So, for example, we get this word here cloudy. So, yeah, the in the inferred reconstruction is do X. So, and this and for the assay with mouth, so for example, the syllable, so is so yeah, the computer runs the transducer backwards to do a projection from Zhao and they see that it is, it is reconstructed as a zoo x so it a court it is harmonious it's it conforms with the with the with the projected with the projected reconstruction of the whole coccid so you get to take care and you get different kinds of process. So, here's a kind of process so for example, my own job and job, at least according to our current hypothesis, it can only come from do not do. So, here it reconstruct this word back, we produce do and tells you that, well, it does have a reconstruction but it's not reconstruction that conforms to the projected reconstruction of the coccid. And here's another example, which is, which, so if you get a memory, you can get a dope or deep. And yeah, it basically tells you that so you see that here maybe for drew maybe your phonological history is wrong, maybe it is a regular lot because it reflects. And here, of course, what's wrong is not the, it's not the phonological theory but you should move it out to create another coccid. Okay, so that's the basic idea of back projecting everywhere which gives you a lot of context and we'll take a look at that later also. And so the basic organization of the Cape her workflow is basically what you would expect so you process the source word list to a kind of computer readable form. You get the bootstrapping stage where the next that algorithm generates the first draft of the cognitive assignment. And then you get iterative improvement before deciding that you are going to publish that as a etymological dictionary. So the interesting thing here is the actual iterative improvement. The idea is that, so basically, since we have this bipartite hypothesis, so you're supposed to, like when you make a better transducer, then it could somehow contributes into the into the coccid editing process to create better coccid and vice versa. So you can have this is exactly you, you work on transducers and then you work on coccid with new stuff you feel very happy and you get your new stuff and you use that back to create better transducers and you feel very happy and yet better yet better until you get the final result. So let's talk about this iterative improvement and both and both parts. So the more interesting part is this that of a cognitive assignment. So how are you going to solve those seemingly intractable problems of of creating coxsets. Well, so here is this misattribution coccid era. And so, well, I will put it like this, since I put it like this, then of course, how are you going to correct them and drag them around. Okay, so with back projection, you get a very nice stuff. Let's just take a look. Okay. So, yeah, here's a video so we so we have this the cloudy, we have this cloudy word that we want to kind of maybe shouldn't put it in full screen because there's a type that says that. Let's start from the beginning. So, yeah, we just looked at this cloudy word and maybe something doesn't really belong here. So, yeah, we're going to working on it. So we create a new coccid. So it's just as easy as dragging the word from the old coxsets to the new coccid. And take a look at how the how the how the reconstruction actually change. Now we put up here since stuff doesn't have reconstruction, according to our current theory so there, there's no reconstruction for this coccid, then we drag the other top here. Okay, this one does have a reconstruction which is deep or dope. So it is inferred that basically this this new coccid likely has reconstruction of deep or dope. Okay, so that's the basic idea you allow linguists to drag words around. And you use transducers to to to provide this most useful kind of context that can be provided to the linguist. Okay. Of course, we just saw that on our board we just have three coccid and have that possible. We have a lot of them. Well, so that's a problem that needs to be looked into very seriously. So here's what we want. So first, partial visibility. The idea is that you, you shouldn't let yourself like be overwhelmed with this with the ocean of coxsets know it wouldn't work. And so, either like you wish to prioritize them so maybe some are more important more and more reliable or like you'd end up in the smaller etymological and lexicon or whatever. But if you don't have that then just give me some, some random coxsets so that I don't have to work on a 330,000 of them at once. And idea is that of grouping so since we have this over splitting or misattribution relation where we really do need the coxsets to be visible to the human eye on on the same screen in order for human beings to realize that okay maybe we should drag it from here or maybe we should merge two coxsets together. So we need to find a way to group together the coxsets that are likely to have an over splitting or misattribution relation. Okay, here's what we do. So, first we have a criterion called a reliable coxsets. So the idea is that, well, since in the etymological dictionary, usually word netimon is justified by more than one example so yeah. At least it wouldn't do us any harm if we just declare that we just we only want to see the coxsets which already has a reconstruction which is justified by words from more than one language. And the second idea is that of course so basically you take a reliable coxsets and you also you try to see if there are other coxsets reliable or unreliable that has the same that can have the same. Sorry, that can have the same reconstruction. So you make an equivalence class you just partition the coxsets into different boards where the boards are those where the coxsets in the boards just share one or more reconstructions. So what does that do well in doing this with transducers you make only one small part of coxsets visible. And also since they're visible in boards, you every board contains coxsets that are very likely to have an over splitting or misattribution relations and since they since some of their words already share reconstruction so yeah they may be very, very, very related. Okay, boards also offers us this task management benefit of chunking it up to manage what bits so that if we take a look at some lovely boards. Okay, here's a board. Here's a board with a lot of coxsets, but not that lot. Here's another board. Notice that for example, here is a coxsets that is reconstructed to ban here's another and some of the, and some of the glosses are all we all or whole so yeah they should be much together. Here's a rather large coxsets. Here's a rather large board. We can merge things together. Here's a very good board with which contains basically only one etymon. Yes, there is one word that doesn't look the same which is this year this year is just a word for water, but the, but the algorithm failed to understand that year is not the same as the other P words so let's move it out of it. And here we get a word that is reconstructed as big and that means to go to battle. So, and here you get the same meaning but the the actual the actual reconstruction even the superficial form kind of allows you to make this decision. Of course the fire short is the related idea to to go to battle so you can merge all those coxsets together. And let's take a look at it. So this is the, I'm not really working here. This is just the results of the first fishing is you don't have a lot of things here. And yeah. Okay, that's the basic idea of it. So this process of using transducers to put board to put coxsets into visible boards we call fishing and why do you call it fishing because basically with every improvement on the phonological history on the transducers, then two things happen, which makes more reliable coxsets visible to us. So, if you do a new bad transducer for a new language or you improve and transducer for language already has one, then either more coxsets are just reliable, or some little small coxsets actually have a reconstruction and can be attached to another reliable coxsets. That's the basic idea. So we have this little nice fish icon to to indicate the new coxsets that's just when that just got fished up. So, here's how the fishing works. So we color code everything in this way so this gray box represents the results of Lechstadt algorithm, which is a huge bunch of possible coxsets waiting to be corrected. So somehow I'm being very hand wavy here. You book yourself up and put together a rough first draft of historical phonology. And with fishing, you actually fish out a lot of coxsets into this red portion, which I call visible or boarded. And then with the cognate assignment interface, the one that we just seen a lot of videos of human linguists can edit the first bunch of boarded coxsets into human curated high quality coxsets. Then you get better hypotheses of historical phonology. And it fish out coxsets with this this better historical phonology. And you all edit them again using the interface. So that's the idea. So with this fishing thing you basically finally have an incremental kind of way out from the from the conundrum of too many coxsets. And here's a interesting thing that I wish to talk about a bit with the wrecks or basically because your first transducers aren't perfect, they are going to keep you pick up some things that with later transducers, the fishing algorithm thinks that they shouldn't belong on the board. So they get assigned this acute dead fish icon. And if we take a light, if we take an actual look at those dead fish things, it's very interesting. So for example, this means tile like the tile you put on the rooftop of houses. And the first thing you notice that is that they look a little bit too similar for for actual cognates. Maybe this should be all or maybe that should be all. But no, they are why va va. And the second thing is that they they they receive quite disparate reconstructions. So this one is why with the zero tone. And this one is why with each tone and this one doesn't have reconstruction it isn't inherited looking syllable. So what happens of course if you speak Chinese you know that it's, it's basically the Chinese word for tile so with. So you see basically that by using algorithms and transducers together it also gives it also weeds out the lens if you're interested in the inherited part of the lexicon. Okay, so here's the idea so we're trying to build an iterative workflow which allow you to work on one of the two parts of the bipartite hypothesis. And yeah, you can transduce. Yeah, when you get better transducers, it can help you to fish coxies and also transducers kind of contribute directly to the UI so you are helped and spoiled by the transducers to to create better coxies which maybe can be used to create better transducers. So some discussion so this this interface and this efficient workflow basically combines the two approaches previously discussed that automatic cognitive detection that have back projection etymology. So, so for the math type, you know, if you do math or if you do math or physics, you know the idea of like one object devolves into degrades into another in an extreme context. So, if you imagine that the cognitive detection algorithm says that every word is not cognitive with every word, it will with any word, and it's basically degrades into back projection etymology and it's basically both but of course it's the actual interesting thing. So it kind of combines the advantages of both methods, especially with regards to the scope of data. So, automatic cognitive detection algorithms are not really fancy about a small imperfections, but on the other hand it misses a non obvious cogniz the kind that gives you aha. I remember like learning I think last year it's really ridiculous that I hadn't done that last year that enough is can look at its tribute when you think of it but if you know the historical technology but it wouldn't click and then wouldn't click to the algorithm either. And back projection etymology gives you those aha's but if there is a little imperfection either in the actual forms or in your version of the historical technology, then it just doesn't doesn't appear. So, on the one hand, we don't want this overfacilies making data invisible. But on the other hand, we do want to know if modern reflectors are allowed to use etymology or not. So, the Cape awake a by putting the human linguist in the center of this approach provides both kind of information so yeah it doesn't gives the kind of information that's the algorithms give and also the kind of information, the, the back and really straight rigorous etymological method. And then it gives you all this information and it lets you to decide what to do with it. Okay, now let's talk about transducers. So the thing with transducers is that, well, while you can have those like basic stuff. For example, you have the algorithm creating the caucus and you can correct the caucus for transducers you need to create things X and the hello and which makes the situation quite frustrating. That's what we want to first you to create your, your historical knowledge that you need to something to work on, you need your words that you think maybe we are cognitive with each other, and you need a visualization mechanism to to give your patterns to make you think about things. And also since transducers, or any kinds of hypothesis of historical knowledge are unmanageable, fragile, jagger notes, you do need some way to decompose it into discrete tasks of manageable size within some feedback so you know you did something good and not bad. Okay. So we work from correspondence patterns. So the old masters of historical knowledge, of course, thinks that the correspondence are the real things and the reconstruction so may cause them here restitution, they're not real they're just some kind of algebraic notation of the real correspondence patterns. It doesn't really work like this, but it gives us a quite good idea of like it's importance and the whole strategy. So, in the context that we kind of where we are from so for the linguistics of China or Southeast Asia or other sign of Tibetan languages. So this kind of thing of putting actual tested forms into correspondence pattern tables. So here's an example so here is the reconstruction of one for a proto meow and you get this proto, proto initial. And so, yeah, you get you get, you really get a, you know, he tries to be exhaustive you put all those words here. And so you can see that they correspond to each other. And here's an less exhaustive example is the day of born reconstructing prototype so you get four legs and five words but the basic idea the same. Also, the idea that of marking of explicitly marking non regular correspondence. So for example, here it says that it says that the tone isn't regular. It says that the vowel isn't regular. Okay, in the most general case, generating correspondence patterns from caucus is a generally hard problem because basically to do this kind of pattern to need alignment, which means that you need to take a little and that they and that they and kind of fit them together into, into this kind of table where so you can say that you can Spanish L which corresponds to Italian L which corresponds to a Romanian L which corresponds to French L and etc etc. However, first we know that we don't really need the actual, let's say the actual algebraic expression of the correspondence pattern so we need them mostly as a visualization tool so I think we can all argue that nothing which corresponds with T which corresponds with P which corresponds with nothing completely misses the point. And so but if you, if you don't look at this, but if you use this thing as, as a kind of index to to generate this kind of China or Southeast Asia kind of complete tables with examples, it, it makes the pattern extremely clear so yeah, even if it's nothing to T to P to nothing which is completely as a line. But we look at this and we, and we get a lot of ideas on where this might come from. Okay, so the actual alignment is not important. And, of course, at least for the current incarnation of the paper methodology we're working with monosyllabic languages so we do a very trivial thing alignment by phono tactics. So yeah, this is like, this is like a phonology. This is like a linguistic theory of 1000 AD, and you just divide your syllable into initial medial nucleus coder and turn. And so, yeah, you, for every language, the language right to kind of pass which passes those things into into the schema. And also the schema is combined this way so initial means initial plus media and then you get media and you get rhyme which is media plus nucleus plus coder. And you get to. So yeah, the important thing is that you see this is you don't take, you don't need to take alignment as some kind of very, very fine thing but as a tool to help you to see the actual patterns. Okay, so here's how it works. So, for example, we have those languages, so we're going to study at sea. The interface is very primitive, but it's me doing the programming so otherwise. In the correspondence you waited a bit for to get the report and so it gives you a title chapter titles initials rhymes and you give those you get those rhymes with everywhere, you get the actual tested forms and also the back projections. And two languages into three languages. So we still need to wait a little bit to get the actual results. Here they are. Okay, let's look at rhymes again. There is absolutely no interface so yeah you just search around in this document. Okay, so, yeah. Here is a turn correspondence. That's the basic idea. And with back projection. Here's a really great example of how back projection really helps you in reconstructing stuff. So, for example, we see this, we see this correspondence pattern. So, G'zhe in A Chang Long Trang and Xie Dao correspond to V in Ati and R in Marui. So, for example, for chicken, cock, hen or the same thing. But yeah, they should be much together. But yeah, there are basically two examples. And so for Bach, it's G'zha'p and W'a'p and R'e. And this one is G'zha'p, V'o and R'o. So yeah, when you just take a look at this, then G'zhe, V'o and R'e, of course, it's really difficult to think of any hypothesis that can like explain them. But of course, you don't need to do that. So, the idea is to enable you to do this gradual refining when you create hypothesis on the largest, the most regular stuff. And what did that give you? Well, after you created them and you take a look here. Okay, so basically, it's very easy to see that you have a grouping which makes A Chang and Xie Dao have a G'zhe, while have a proto-language R'e, while Ati and Marui have proto-language R'e. So yeah, basically for those two words and probably others, you get a G'zhe in some languages and a R'e in some other languages. And you can like make more sophisticated hypothesis from here. So yeah, this is marvelous because like in the real world, you need to basically maybe someone reconstructs a bad proto-language and you read this article and you write another article saying that, okay, how are you so stupid in missing this obvious correspondence pattern? But of course, when your whole mind is taken by those kind of very basic laborious work of collecting correspondence patterns, you can't really see that. And here you can really evolve the hypothesis in a very effortless manner. Okay, so the next problem is debugging. So yeah, you have ordered sound changes, you have feedings and bleeding and whatever. So yeah, it's like the Chinese saying, you make a change to one piece and the whole thing moves. So yeah, this in itself would be very difficult. So the idea is to factor the whole change into tiny, tiny atomic changes and in a comparative manner. Okay, so I think it's clear with an example. So let's continue our video. So we're still studying Atibola and Maru and now I'm looking for those cross marks. Those cross marks mean, say that for example, the northward here doesn't even cannot come from a proto-Bermish etymon according to the present theory of the historical phonology of Maru. So yeah, I'll be looking for those kind of things. I look for crosses, crosses, crosses, crosses, crosses, crosses, crosses, crosses, crosses. And here's a good one. Okay, so if you look at it, then there are two crosses here. So you have a whole correspondence pattern for which the con theory doesn't work for any of the forms. So for far and for rely, the Maru form does not correspond, the actual assessment Maru form doesn't correspond to anything. And to measure, it corresponds to something, but you see in every other language is scare, why is que here? It's impossible. So here is a problem. So what's the problem here? You see, you have this air vowel in proto-Bermish, which gives e in ati and e in Bola. It's the same vowel, it's the same vowel everywhere in Asi and Bola, e and e. However, in Maru, sometimes it's e and sometimes it's a. So earlier we have like forgotten and we didn't see that some vowel, some of the air vowels could become a. And what's the condition? Okay. And here is it. So basically, you have the ears and the, well, the, I think the sound change from proto-Bermish word to Maru verb could be quite recent. So, yeah, the idea is that maybe the proto-Bermish vowel air changes into R after velas, including the labial vela approximate. Okay, now we found our problem. So for demonstration purposes, I start with inventing a very bad solution. So here's the bad solution. So first, here's the sound law that changes air into R mechanically, mechanically without taking into account the phonetic conditioning. Also, the placement of the, also the placement of the sound law is wrong, the sound change is wrong. So you get air to R here, but here is an already existing example in Maru, an existing sound change in Maru where R changes into O. So, of course, we can, we can, we can predict that our bad version changes into R, which then get changes into O, which is not what we want. Okay. So what we get, well, for the words that we're actually interested about, we get nothing. So, what did not have a reconstruction, so for example, Fa doesn't have English now neither. So, yeah, the only, the only difference is that it was the wrong, the theoretical reconstruction was there, but the current theoretical reconstruction was wrong. However, and here we see the beauty of the CAPEM interface. So, yeah, basically you have two rows. The first row corresponds to the old transducer and the second row corresponds to the new transducer. So it checks if your, if your new transducer actually explains the current forms better or worse than the old transducer. In this case, it's worse, and you get, and of course, since it's a systematic degradation of the, of the quality of the hypothesis, you get a huge swath of red with frowning faces. And the whole correspondence pattern, which is Capers' way of telling you that something went terribly wrong. Okay. So now we are reminded of our historical linguistic 101 and the idea that, well, sound laws have consequences, and so we have, so we put it into a better place. And you get the smiley faces telling us that, okay, now the actual attestive form, Va, is properly reconstructed to the Pan-Bermish-Ettiman-Wehr. However, of course, you still get this huge swath of red, sad stuff. Okay. So we are going to do things seriously. So we remember that it's okay, it goes after Vilas. So what are the Vilas, KG, and we remember there's this W and maybe aspiration. So yeah, it happens after Vilas. And wonderful. So yeah, we have, okay, what happened, gosh, okay. So yeah, it did it right for Va and measure, but not for Relye. Why? Because of course Relye has this N and yeah, you see, yeah, I meant to do this thing wrong in the beginning for demonstration purposes, but now it caught me actually in an actual error. So, okay, here's the right version of the sound law. G-G-N-N, we finally got the N. And okay. Okay, so yeah. And yeah, I should feel very happy about this, this kind of thing, because when you get like a whole correspondence pattern with smiley faces and green, then yeah, then it's a new situation, a new sound change that is now properly accounted where it wasn't before. However, the, we still need to change, we still need to check if it doesn't make anything worse. So you do, okay, because it's very bad. I type the frowning face into search and search tells me that there is no frowning face in the thing. Okay, so I'm very happy about my new sound law in Boehler where air changes into R after videos. Okay, so the discussion. So yeah, for the, for the transducer debugging part, well, so first you get this visualization of the China SESC style forward correspondence charts, which provides the necessary and minimal information for generating the correct hypotheses. And the next thing is incrementality. So basically, yeah, it's very dangerous to change transducers. So we have this kind of atomic change. So the diff and report with smiling and frowning faces, a minute loop, which ensures that if you are going to do a trade off, you, you know that you're doing a trade off. And if you do an improvement is an improvement and if it's not, it's not. And the third thing is iterative iterativeness iterativity. So yeah, the, what do caucus do in this part where when you have better caucuses you get more examples. For example, I think the reason that we didn't catch this. R is that there were too few caucuses in the first version of the fishing. So, so when you have one example you're not really sure if something needs to be done about it. And also, when they're checked when they're properly checked, then when you when you when you get the rubbish out of any out of a computer algorithm generated caucuses you also get less red herrings and you get just a much, much clearer picture of the of the historical Yeah, so that's basically it. So yeah, we just talked about the transducer debugging as a part of the iterative workflow. So when you get better caucuses, they can, they can generate a good correspondence patterns for you to debug your transducers to make improvements in your hypothesis and in the iterative and instant feedback way. That's basically it. So yeah, yeah, so sorry for this thing being a bit a bit technical, but I think if we're really going to talk about how to make computers actually useful for historical linguistics will need to see from the beginning to the end how you can how you can make this this workflow consistent with itself and iterative and stuff. Okay, that's it. And so, future directions. Okay, so yeah, we're still working on the current incarnation of the, of the Cape Town methodology so with a view of producing a really good entomological dictionary of the language languages and, frankly, I don't program that way. Also, yeah, but just So, putting it more seriously so we see that a lot of things are greatly simplified by the fact that we are working on monosyllabic languages so the basically the thing is that the first the alignment problem, which is really hard is simplified. So, in order to address the alignment problem, I think we'll at some point need to make much better transduce engines than than the kind of, you know, it's like a peak. It's like a pro pro regular expressions versus real regular expressions. As long as you have this by directionality, I think we should actually explore ways to to put more stuff into it. And also, since we sent with monosyllabic languages we don't really care about like paradigmatic morphology or analogy, which, which, which destroys the perfect not gives it a relationship between the inflected form in the daughter language and the inflected form in the source in the proto language. And yeah, first, we need to think about like even encoding them in the in the rigorous and machine readable and machine verifiable and way is an open problem. So let's end, let me end this talk by saying some good words about transducers. So I think at some point with this and other technologies, so it can, it can bring us to to something that is in French we say democratize, which in English is like the gatecaped. So for Indo European linguistics you, you, you get an actual discipline and people work on it. So yeah, so people talk in the laziest, most symbolic way possible. But for them is is doable but like for for for minority languages, we need to be very frank and I don't think we're, we're getting we're getting, we're getting, we're getting 300 professors veering with each other on on one tiny language family in in the in the jungles of North Laos. It doesn't work like this. So, so what we can do is to is to make is to make our hypothesis transparent and so easy to change and easy to understand. And so kind of a kind of less institutionalized less intense cooperation could still move us towards a more equitable future for the historical linguistics of words languages. Okay, that's it. Thank you very much. Thank you for your question. Very presentation. We didn't get a couple questions in chat, but since you're calling me, you're calling Nathan Hill here with us so he was able to respond to some of those topics. One was about that question of, if you're going to be able to move on to poly syllabic languages with this but you just addressed a bit. Nathan, what is the usability of this code people that don't have a background in coding or computer programming? How accessible is this for historical linguists who want to apply this technology but you're basically starting at ground zero when it comes to coding. Okay, let me start on the poly syllabic problem. So for poly syllabic languages, there are things or like for for non for non Chinese for non Chinese ish languages. There are things that are more difficult, but there are also things that are simpler. So for example, a large problem for for for actual bemish work is that since almost every since almost every word is bimorphemic or or quadrimorphemic. So it's so the the juggling of these parts and stuff around things get much difficult, much more difficult. And I think that's that's the reason why Lex that is supposed to work extremely well but that does not work as well in now in a bimorphic example. So, so first is the thing is that like if you if you take, if you take like our favorite example Polynesian, then as long as something can be done for the for the alignment that requires less human intervention. Polynesian is in a lot of senses much easier to treat than the kind of monosyllabic Chinese China or Southeast Asian kind of languages. That's the first point and the second point is that okay so yes it is very difficult and yeah after putting this to something I I'll be I'll be I'll be quite actively thinking about that. What was the second question. No one was about usability of people without coding experience or are you making the computers useful for people who don't know how to code. Okay, so let's say the good thing about transducers is that writing transducers is much, much, much easier than actually coding. So, so for anyone that had, so for anyone that had had experiences like had had like a classical generative phonology. You know the notation and the, that's the first thing and the second thing is that we get this kind of interactive. Okay, so for example, let me talk about myself actually programming so I, I know everything about programming but I can't program very well. However, I do proof assistance quite well so proof assistance is the kind of thing where you write a new thing and computer checks if your thing is correct and whether it fits into the context so you get all those like little things that kind of helps you. That helps you keep track of the details, which is, which is one of the basic ideas of paper and I think, yeah, well, I don't terribly worry about this. We also get the other part which is that for any language wheel for any language group we will actually have a lot of things that that's like this a very clean user friendly version of paper console and you need to Python. You need to do some Python, but I think we'll need to create some kind of ecosystem where like, where like, like, you know, we can help linguists with those kind of more so many more questions. Yeah, may I, may I, sorry, but may I just say kind of at a totally practical level. If people want to work on the Burmese languages, there's an extremely easy user interface that's online right now. Yeah. So I can't program at all and I can use it very well. And, and this is what shouldn't show it off and it works very well. But he has had to do a lot of programming in the back to make this possible. And, you know, and he's not, I don't know, we don't have a professional programmer or anything like that. So the back end is probably a little bit kind of shoestring and chewing gum. So if someone wanted to work on it for their own language family and not for Burmese, they would have to do a certain they would have to have some kind of program involved and they would also have to do the pre processing to to make the data ready for this workflow. You know, they could start from the code we've had and then in terms of what shown was saying, I think that as a discipline and this is one of the reasons, you know, why we want to sort of talk to people is, is we can't be the only people who make this thing, right, because then it will be made for us. And instead, if people can become convinced of it as a as a as a conceptual framework, and as a methodology, then either they can actually build on the code we've built, or they can do their own right and it would be good to have a sort of ecosystem of people working in this way. And then I think the last thing I would say is, is that Mantis lists group has produced a lot of data that would flow into this system very smoothly. So they have stuff on by they have stuff on home and they have stuff on to conno. So if you happen to be interested in one of those languages, it would be substantially easier to get this system going. Yeah, that's what I wanted to say. Yeah, so it sounds like there's a bigger concept here that you've sort of showed a proof of I guess for one set of languages that you'd like to see grown into other languages that would require setting up more data sets more interfaces of people interact with to make that more usable, but this could potentially be extended if more people got together around this concept. I agree with that. Yeah, I think so. And I'm, and yeah, and yeah, I wish to echo the previous remark of Nathan that basically is 2020 and and for lots of languages you get very high quality for more more than that you get very high quality free online data sets. And, and theoretically capable can help you produce this, produce very good results for those data sets. Great. So I think we'll leave the presentation there for now then hopefully there will be more conversations as people either contact you about very much data or trying to develop this for other language groups as well sounds like there's a lot to to discuss here. Thank you so much for taking the time to prepare this presentation and share with us. Definitely we all learned a lot and a lot of appreciation in the chat to for the quality of your work and what it means to the field of historical linguistics especially for these minority languages that are more or less been ignored in the in the big picture of the political linguistics. So thank you for sharing. Thank you very much.