 It's a pleasure to welcome you to the very first departmental seminar of 2016. Our guest today is Elizabeth Eden. Elizabeth comes from a background of physics and linguistics, and she is actually doing her PhD, and she's working on the question, can we develop an objective metric of phonological language distance? So she has constructed the database and web tools for comparing segments, a segment distribution between languages, maybe between dialects, and this can be used to find the long words, maybe elicitation items for experiments, trying to figure out a phonemic inventory, that's been very interesting for us from phonetic transcriptions, syllable structure and so on. So this is very much what Elizabeth will explain to us in more detail today. Alright, thanks. Okay, so this talk is basically going to be an overview of this database and the website with the analysis tools on it. So if you have a tablet or something like that, you can browse this whilst I'm talking and see how you find it. The web address is up there, and we'll be coming back repeatedly for the rest of the talk. Okay, so first of all, I'm going to go through a bit of background as to what this tool is and why I've been developing it, then the kinds of data that we input to it, and then I'm going to go through some of the analysis tools that have been developed and what you can do with them. Okay, so as Condi said, my PhD is about developing a metric of phonological language distance, and one of the ways in which I was approaching this was the kind of traditional descriptive parameters of a language. So what kind of vowel contrast does this language have? What kind of syllable structure does this language have? Unfortunately, different grammars use different criteria for deciding on parameter values. They have their own personal theory of what are the important phonetic criteria or anything else for deciding on this. Those of you who are familiar with Eric Rown's work may have heard his rant about pre-nasalised consonants and the conflicting ways of deciding this. So basically, I thought I can't get the parameters straight from all of these different authors. It would be good to take a step back and put together the original data that they were working off. How did they come to these conclusions? And then if I disagree, I can harmonise my data and make sure the parameters are consistent. And if someone else comes along and says, actually, that's a stupid parameter, they can change it. They can see what data I'm working from and use their own criteria for deciding on parameters. The second thing I discovered over the course of this is that it's really difficult to find lexicons. I think for people who are working on a language, it's pretty obvious where the lexicon is. But for the rest of us, it kind of hides away in people's libraries or in different databases, and some of them you only get access if you're a member of this programme, and some of them you have to pay dictionary makers for. And they're all very difficult to find. So I'm trying to collate as many as possible into the same place. And then even if you don't like this particular way of displaying the data, you at least have links to where all of these lexicons come from and you can contact the authors and get hold of the original information. So the idea is to try and help people collaborate with their data. The web tools part of this came about partially through doing the field methods course here at SOAS. And we were trying to establish what the phonological inventory of the language was that we were working on, which these sounds are phonemic, which are alophonic. And what we discovered is there's already quite a lot of data had been collected on Ciletti, the language we were working on. And so we wanted to say, well, do we have examples of these or not already? Now there's a tool in computing called regular expressions which are very powerful ways of searching for different patterns. Unfortunately, because they are powerful, they're also very difficult to get right. So this is quite a famous comic from XACD. Basically, you try and solve your problem with regular expressions. You now have an extra problem, which is that you're now working with regular expressions and you still haven't solved your original problem. So part of the idea of this was to make some of these tools slightly more user-friendly and make the computational tools that I had access through writing regular expressions available to people who didn't have such a strong computational background. Okay. So basically the database takes in two different kinds of data. It takes in lexicons, and it takes in things I'm calling transcription conversions. So lexicons, fairly obviously, are minimally a list of transcribed words. And then you can also include other information for each language, but you don't have to. So you might also include for each lexical item some orthography for it, frequency of occurrence in a corpus, not necessarily the one you got the word from in the first place, a gloss, probably in English, because that makes comparing across other languages easier, part of speech tagging, and then custom tags for whatever you want. So for example, for Silletti, one of the custom tags is Sanskrit, because that means you can then find words that have a particular origin in your later work. So here are some examples of these on the website. So we can look at the existing data and look at some word lists. So if we look at Greek, you can see that the authors of this lexicon, in fact, transcribed their data. This is interesting. There we go. Okay, is this legible, or do I need to zoom in for people at the back? Do we need to read very well? Not particularly, but if you can't read any of it, then I may as well show pictures of caps. Okay, so as you can see, the people who compiled this Greek database basically transcribed their data using Greek orthography from the way they describe it. I think, in fact, it is just Greek orthography, but with their data have a methodology that they're using for converting this into phonemes, which we'll see a bit later. And then this obviously also includes the orthography. So for other examples, so for English, we see that this has been transcribed using the disk transcription system, and we also have orthography. And thanks to the Sutter-Lex frequency database, we also have frequency in parts per million for these English words. And then for language like MATV, this is possibly more similar to what you're going to get if your lexicon is coming from field work rather than from compiling a database out of subtitles or so on. You have transcription, no frequency data, but a gloss and part of speech data. So these are all different ways you might input your data into the system, depending on what you have available. Okay, so the second kind of data that you have is a transcription conversion. So because you can upload your data in whatever format it was transcribed and whether that's Greek orthography or your own personal idiosyncratic transcription system or the IPA, this is a way of mapping it to a single system so you can compare across languages. But you don't have to always encode your data in the same way. So the second advantage of this is that you can map your data from its detailed phonetic transcription multiple different times and compare the results of these analyses. So, for example, if we take Polish, here you can see the transcription map for Polish. Most of it is just saying they've used the IPA. But here the Polish data was originally transcribed using a nasal archiphone name and in this instance we just said, okay, wherever that appears, we'll just say it's a nasal vowel. We're not interested in all these complicated theories of Polish nasals. We're just going to transcribe it as a nasal vowel. You or somebody else can come along later and say, actually, that's a pretty poor analysis of Polish. I'm going to remap this using a different system and you can compare then what happens with the different systems. So this allows you to refine your analysis as you go along without having to re-upload your whole word list every time you change your mind. And other people can keep working on the old analysis and they can refer it and it allows this sort of greater flexibility. Okay. And then finally, so every lexicon and every analysis are linked to at least one reference. So if you upload something to this website, you are required to specify some kind of reference, even if that's just your name and work in progress. And where possible, I'm encouraging people to link to these data other places. So for example, the French data is linked back to the lexique website and so on. So people can track down where this came from the first place and then get hold of the raw data to put into other tools if they're interested. Okay. So that's the input data. I'm now going to basically show you the process of uploading that data. For me, a large part of the point of this talk is to try and get feedback from everybody on whether this website works. So I'm going to walk you through it. I'd appreciate at the end if people can give me feedback on this is intuitive, this really makes no sense. It would be easier if you added this feature, that kind of thing. But so basically the workflow is you upload your lexicon, your data, your transcription conversion, which together form a document, so a particular documentation of a dialect. We apply the regular expressions to get sets of sequences like word initial consonants, word medial consonants, word final vowels, these kind of sequences of segments. And then using the online tools, we can do some analyses. So we want to add new data. First of all, if your language is not already in the database, add it with its ISO code, so we don't end up with duplicates. Add any references that you're going to be needing. And you can see which reference is already in there. So if you are adding yet another Germanic language, you can check whether the, see those references already there. Ping it to a single system so you can compare across languages. But you don't have to always encode your data in the same way. So the second advantage of this is that you can map your data from its detailed phonetic transcription multiple different times and compare the results of these analyses. So for example, if we take Polish, here you can see the transcription map for Polish. Most of it is just saying they've used the IPA. But here, the Polish data was originally transcribed using a nasal archaephonium. And in this instance, we just said, okay, wherever that appears, we'll just say it's a nasal vowel. We're not interested in all these complicated theories of Polish nasals. We're just going to transcribe it as a nasal vowel. You or somebody else can come along later and say, actually, that's a pretty poor analysis of Polish. So I'm going to remap this using a different system and you can compare then what happens with the different systems. So this allows you to refine your analysis as you go along without having to re-upload your whole word list every time you change your mind. And other people can keep working on the old analysis if they prefer it, and it allows this sort of greater flexibility. Okay. And then finally, so every lexicon and every analysis are linked to at least one reference. So if you upload something to this website, you are required to specify some kind of reference, even if that's just your name and work in progress. And where possible, I'm encouraging people to link to these data other places. So, for example, the French data, I'm not saying that yet, is linked back to the lexic website and so on. You can track down where this came from the first place and then get hold of the raw data to put into other tools if they're interested. Okay. So that's the input data. I'm now going to basically show you the process of uploading that data. For me, a large part of the point of this talk is to try and get feedback from everybody on whether this website works. So I'm going to walk you through it. I'd appreciate it at the end if people can give me feedback on this is intuitive, this really makes no sense. It would be easier if you added this feature, that kind of thing. But so basically the workflow is you upload your lexicon, your data, your transcription conversion, which together form a document, so a particular documentation of a dialect. We applied the regular expressions to get sets of sequences, like word initial consonants, word medial consonants, word final vowels, these kind of sequences of segments. And then using the online tools, we can do some analyses. So we want to add new data. First of all, if your language is not already in the database, add it with its ISO code so we don't end up with duplicates. Add any references that you're going to be needing. And you can see which reference is already in there. So if you are adding yet another Germanic language, you can check whether the... see those references already there. You upload your word list. Your word list is just a text file, so you don't need any fancy formatting. All you need is a text file with the items separated in commas like that, and there's help here. Then you can add a transcription conversion. So I think we will add one of our own. So the English data that we already have has been transcribed with syllabic consonants. So if instead we think actually now I don't want to look at the English data with syllabic consonants, you can instead have a new analysis. Ooh, very tiny. That's completely illegible, but I can't get it to zoom in. And you can re-transcribe it with non-syllabic consonants. So it's as simple as this. It's just a text file with mappings from one to the next. Ooh, I must be logged in. I don't remember my password. I won't log in. Okay. Then you can re-transcribe your data. You can pick your language. You can pick your transcription conversion and apply it. And then you calculate your different sequences, and it pops up here. And at this point we can say, okay, what are the word initial consonants in our English data? And it will list them, which brings us onto a set of tools. Are there any questions or observations so far? Right. So here then are some of the things you can do once you've uploaded your website, your data to this website. You can filter the sequences, these word initial or final consonants or whatever, to look for particular patterns. We'll go through that in detail. You can compare these sets of sequences to try and find out where certain things occur, whether they're two different dialects or placements in a word have the same properties. You can compare properties across languages. This thing I found quite useful in this is using it to locate typos and loan words. So this was quite useful, again, in Selesi. And locating suitable experimental stimuli. And finally, I think, and quite importantly, makes it easier to collaborate with other people. So first of all, then, filtering sequences. So let's say we want to know what the word medial consonants of English are. Okay, I'm interested in English. I am interested in the word medial consonants. Show me those. And this will give you the sequences of segments, how many electrical items in that list have that. And then if you've got frequency data, the total frequency of that sequence. And then some example words. For English, if we're only interested, say, in sequences that have falling sonority, you can do that. So for example, this set here allows you to add lots and lots of extra filters onto your query. So you've said I want English word medial consonants. And there's a set of predefined searches that you might want to do, or you can add your own. So for example, this predefined one says, okay, I'm interested in English sequences with a certain length and a certain sonority profile. I'm afraid that for the time being, sonority in this application is predefined. It does explain how, if you go and read the help pages, eventually it would be nice if you could define your own sonority scale and exceptions and see what happens. But for the time being, it just uses the standard one. You can specify these things and then search for the very particular kinds of sequences that you're interested in. So for example, here we are. Here are all of your falling sequences in French, in English. Okay, other things you can do. You can also ask for, for example, part of speech data. So I was explaining some of the different kinds of tags you could put on your data. If, for example, you've uploaded part of speech data, we can say, okay, I'm interested in my amble data only in knowing what's happening in nouns. And so here we have all of the nouns which start with consonants in the language amble. And then you can find examples of these. And so then when you want to go back to your informants and say, okay, I need to get them to say a noun in this particular part of the sentence which starts with the sound l, you can just come and find it and find the list of words. Oh dear, there's always something. So yes, as this website says all over it, it is still in beta. Part of the point of doing this is to get everybody else to use it and then they can tell me where all the errors are because it's only when you click on every single button in combination that you find them. And then, yes, so the final example of this would be full locating things like typographical errors. So if we look at Silletti words, initial consonants, we're not interested only in nouns. We can sort this by the number of lexical items. By clicking here. And we find that basically all of the frequent ones are single terms and all of the infrequent ones have multiple consonants and we can say, okay, is there something odd going on here? And when you look into it, you start going, oh yes, this is the word school. This is the word train. This is the word class. And actually this data set has been a little bit cleaned up. There were quite a few more before and when I went back and queried whether glass was really a Silletti word. I said, oh no, no, no, it's always pronounced glass with an epithetic vowel. Okay, so it's helpful to say, why is this pattern only occurring once in my entire data set? I should go check that because either it's a loan word or it's a typographical error. Or there's something really interesting going on here and I should investigate further. Okay, so then the second set of tools are these set comparison tools. So these basically allow you to do a pretty similar thing to the previous page but for two languages at once and see which things are the same. So for example, let's choose English and say we're interested in comparing English final consonants and English initial consonants of length one. And surprise, surprise, we find that sequences that are found word-finally in English but not initially. Angma and the vela fricative which I think is the word uch. So your mileage may vary on whether that's a word. And then similarly you find h and r are not found word-finally in English and so on. And so obviously for English this doesn't tell us anything we didn't already know. But if it's the first time you've compiled data on your language, you can very quickly go ah, I think I found some kind of aliphonic pattern here. And in fact, by applying this to Silletti it became pretty obvious that several of the things that had been phonetically transcribed were in fact just aliphones of each other. And then when we put in the phonemic analysis that we derived from doing this matches up pretty closely with ones that have been developed elsewhere. So I'm pretty chuffed with that. OK, yes. So another thing that you can do, for example, is compare what happens with different parts of speech. So going back to Silletti again. If we look at Silletti initial consonants and we say, OK, I am interested only in these ones where the parts of speech is a noun or a noun and an adjective or a noun and these ones where the parts of speech is a verb you can compare these and we find that all of the branching onsets in Silletti appear to be a noun which lends extra weight to the theory that they're all from recent loan words rather than an inherent part of the language. And then this backs up the frequency information that we saw in the previous tab. And then yes, finally you can just compare different dialects. So for example, you might say, I'm interested in comparing word initial consonants in Greek and in Spanish. We will delete the nouns and verbs. And you can see there's quite a lot of overlap between Greek and Spanish, but then there's also quite a lot of difference and you can pinpoint what that is. Obviously for languages closer than that, that makes it a lot easier to do dialect comparison. For example, if you've got large numbers of words in two or three related dialects, you can put them in here and see what's similar. Okay. So the third tool on here is the split sequences tool. Now this was developed partly because of reading this paper and the idea of this tool is to help you verify claims about syllable structure. So for example, François Del said that there are certain restrictions on consonant clustering in French, which are best accounted for by positing two different types of rhyme and syllables with the compound type can only occur at the end of words. And the compound type consists of a simple rhyme with an onset after it. Now they acknowledged in their paper, this was 20 years ago, and they said, you know, our data has been coupled together from three different sources. We did this one listed word onsets in French and this one listed coders in French, and we just kind of put all of these different sources together. But obviously, times have moved on, it's now much easier to verify these kind of things. So for example, what we can do is we can take sequences of interest. So for example, word final consonant sequences and sub-sequences of interest. So according to this theory, we've got initial consonant sequences of French. And then we can say, OK, well, what I'd like to do is locate these sub-sequences on the right-hand side and find out where the matches are. And so we see that for vre, the longest match is this onset re. For re, you match it entirely. There's nothing on the left-hand side. And for re, as in per-re, we find that in fact we have multiple consonants that you can match. At which point you get something that looks like this. So this is a list of all of the word final sequences in French. And on the right-hand side are the parts of those sequences that you can find at the beginning of French words. And on the left-hand side is whatever's left over. OK? And actually, if you look down this whole list, which is slightly concise here, here is the list of things that are left over, and they're all single consonants. They're all simple rhymes with these two exceptions, which are each one lexical item. This one is the English word miles, and this one is the English word underground. So actually, Francois Delthier is entirely correct in terms of the data that we've got available in the 25,000 most common words of French. And so this allows you to just check these kinds of theory. And then you can compare the results of those. So actually, are all of the final consonants of French found in those simple rhymes? Are they all the same? And if we compare them, we can see that actually all of the single word final consonants of French are found in that environment. The only exceptions being, as I said, the two English exceptions. Okay, so those are the three different kinds of analysis tools for mostly folks of working within a language, although as I said, you can use it to compare different dialects. We've also got three different ways of comparing between different languages and dialects. So we will start with this one. This is the Compare Doculex tool. Basically, it has some preset sequence types. These are the preset filters from the other page. So branching on sets, rising internal sequences, vowels that occur more than five times. And what you can do is you can just say, I'm interested in comparing what this looks like in multiple different languages. And then you can get a quick overview of what these languages look like, whether they look similar or not. So for example, we find that in all three Indo-European languages, they have these triple branching on sets that start with an S. Whereas in Czech-A-Holo, in Oceanic language, we've only really got these Czech things. And to be honest, once you start looking at the Czech-A-Holo data, you go, those are Africans. That's not a triple. That doesn't have three segments at the beginning at all. And so on. And so you can start comparing. This is just a quick overview that allows you to get a quick feel for what your language looks like and what it looks like in comparison to other similar languages. Okay. And then again, you can just add filters to this one. So you can say, actually, just take out anything that only occurs in two words because I don't care if it only occurs in two words. I just want the major patterns of this language. Or indeed, you can say, I'm only interested in things that are super rare. So only give me patterns that occur in less than 10 items in my entire database. Okay. Second thing, this is close to what I've been working on. So this allows you to compare the values of the parameters that describe a language. Now, these are not perfect calculations of those, but I think it's a very helpful place to start. So, again, let's look at, say, French and English. And we'll look at all items and places where there's at least 20 words that have that feature. And we can sort these. Okay. So let's compare English and French. Where do they differ? Well, English has far more vowel heights than French does. So this parameter says, is there some kind of TensorFlow contrast or ATR contrast? I don't know if that's visible at all. If you hover over it, it will give you the justification for why it's done it. So you can say for English, English has the following front vowels and the following back vowels. Therefore, it probably has TensorFlow contrast. English has a glottal fricative where French doesn't. In this transcription of English, we've got syllabic liquids and syllabic nasals where French does not, and so on and so forth. We have this feel-affricative for some reason. But this, I think, is a good place to just get a quick insight into your data, particularly if you're not a phonologist and you go, okay, I've done my field work. I have all of this data. Should I show any of it to a phonologist? Is any of it interesting? Well, you can go on here and say, oh, this parameter has a different value than every other language in the database. Maybe I should ask somebody about that. And hopefully it will just give you a quick insight into what's going on. Like I said, the way it's calculated is fairly rudimentary. So this isn't a good place to go for actual answers for how you should describe your data. But it's a good place to start getting some ideas, I think, and to start looking into and going, well, why does English appear to have a vela fricative? What's in my data that has given English a vela fricative? Okay, and then the final comparison tool is a word comparison tool. So, for example, you can say, give me all of the words in both Matt, Badd and Amble, whose gloss contains the word fish. And then that way you can quickly say, are there words that are related in these two dialects? And basically, you can start doing a bit of rudimentary comparative linguistics quite quickly. Alternatively, you can say, I'm interested in whether the orthography shares something like this sequence of letters or the transcription has this sequence in it somewhere. Okay, I think it's just, since all the data's in there, quite a helpful tool to have. Okay, and then, yes, finally, so this wasn't actually one of the design purposes of this, but since I've started explaining it around UCL, I've had a couple of people come and go, oh, that's really useful. I'll use that for my experiment, which is that you can then use this to say, I'm doing an experiment on on-sets in English. I need to find the set of on-sets in English that have standard rising sonority, none of this weird S stuff going on. They need to be exactly two segments long, and they need to have similar frequencies so that I'm controlling between my test items and my control items. Well, you can just do this. So you can say, okay, I've got English. They're this long. You've got increase in sonority. And here are your examples. You can say, okay, I want just the most frequent words, and there you go. There is a list on the right-hand side of all of the words in this database that have those segments. And at this point, it's very quick and simple to just go in and locate the relevant experimental stimuli. Okay. So then finally, access. I really, quite often, people are a little bit reluctant to release their data into the wild and have it vanish off outside of their control where people could do anything with it and where it may just, the attribution may just get completely lost. So like I said, every lexicon and analysis is always linked to at least one reference. You're not sending this off into anonymity if anyone strips out the citations. That's because they're stripping out citations and you can't do anything about that anyway. You can upload your data privately. So here we go. First of all, references. You can find which different word lists are relying on the same source and vice versa. You can find out which sources a word list is relying on. I'm afraid that this computer doesn't have my login and therefore demonstrating this has gone a little wrong. I can demonstrate that later. Basically, in order to upload a word list, you have to log in. You have to create an account and log in. Once you've done this, you don't have to share that information with anybody. The only person who can see your lexicon once you've uploaded it is me if I go on to the server and start messing around in the database and say, oh, what's happened today? Nobody else is going to see your data unless you choose to share it at which point there is one click sharing. You can say, I want to make this open access and then anyone who comes to the website can see those languages. So as you can see, this is a set of word lists that are all open access. Anybody can see this who just comes to the public website. There are another half a dozen word lists that I've uploaded for my use, but I don't have permission from the original authors to then share them with the wider world. Some of them because they're still working on them and they don't want to put their data out there before they finish cleaning it up. Some of them just because for copyright reasons they're happy to share it but not openly. But you can share it with other people who have accounts. So you can use this website to collaborate with as many people as you like on your dataset without actually sharing it more broadly than that. Finally, I know that some people have restrictions on their data that mean that they can't share it even with me because once it goes on to my server that's then gone beyond your data protection and you may have reasons for promising your consultants that their data is going to be kept very secret for a certain amount of time or what have you. At which point you can install your own copy of the database and all of this software on your laptop. You can take it with you. You don't need an internet connection. It's all just there on your machine and you're the only person who has access to it. At that point you don't have access to any of these other ones that have been uploaded because obviously that's how the internet works. But you can keep your work whilst you're collaborating whilst you're out on the field. You can still use all of these tools. You just won't be able to use the collaboration of having it uploaded onto the web. Okay. So this basically brings us to the asking for data part of the talk. So if you are working on a lexicon for a language, if you've been doing field work, if you've been collaborating with somebody else, if you came across one whilst you were reading an interesting paper the other day, please do send it my way, upload it to the website and let's get as many lexicans into this thing as possible. Currently it's rather Indo-European centered because those are the big projects that have already been done that are already publicly available on the web. There's lots of very decent lexicans available for all kinds of non-Indo-European languages, but they tend to be lurking somewhere in the service of a university and are not as publicly accessible. So, again, if you have this kind of data, please do upload it. Similarly, if you have your own personal theory about one of these languages, if you very strongly object to the fact that Polish has been transcribed with nasal vowels or that English has been transcribed with syllabic consonants or any of these things, then please do upload your own analysis. Start messing around with it, seeing what happens. And then you can, yeah, upload a new transcription conversion, compare between the two versions. So different dialects could just be the same underlying word list, but in this one we've got English with syllabic nasals and in this one we have English without, and how does that change the structure of the language? And then, yeah, absolutely, use this to discover things and then publicize them. Okay, so thank you very much for listening and I really hope to get lots of feedback from everybody on what works and what doesn't. So please do tell me. We also have at least one laptop here and here. If people want to just play with it themselves and have a look, then that's absolutely fine. Okay. Maybe you're going to say that it doesn't matter about just you listening to your presentation sometimes I thought about how about that, how about that. Like at one point you were going very, very quick, so it was difficult to look at the data. No, no, it was not about looking at the data. So it's not your fault. But at one point I just saw a compound in English. So then I thought, okay, we're looking at word internal clusters and I saw a compound. So then I thought, does it matter to you that it is a compound or it doesn't matter because I suppose that it would matter a lot to me. But maybe you would say no. You're big enough to see a compound discarded because I'm going to keep it there and do the homework. So the idea would, this is part of the point of this other tags field. For example, you can say things like, okay, I'm going to tag each of my words. This is a compound. This is a morphine. So this has to be the user doing it. So do I have an example on here? So if we look at this map about one, you can put in your custom tags. You could put compound in. But this is up to, basically this is the standard which is garbage in, garbage out. So if the user, if the person who's compiled the lexicon has not included that information, we don't have that information. So in this case, the origins of the English word list are the selects database. And they, I don't think included it, certainly not in the query that I got out of their database. So we just don't have that information. Yeah, I think it is very important. It would be, if you're interested in doing that, you can, for example, transcribe your data with marks between compounding, a dash or a full stop, whatever you want to use. And then you can have one transcription conversion that strips those out and says, actually, that's not... If the Greek person has not done that and I want to compare compounds in Greek and French you'll see just for the sake of it, then... If the data's not there, it's not there and there's not much you can do about it. If people have suggestions for ways of automatically doing that, but I can't think of any, so I haven't done it. Because my concern is that if it's left, and I take it that you're not a policeman, but if it's left to every one, every each individual, to come up with their own morphological analysis, syllabic analysis, an orthography, this and that, transcribing the tones, I'm just wondering, at one point it may be tremendously difficult to do comparisons. Yeah, I mean, this is definitely one of the problems with it is unless people are using the same criteria then it's difficult to compare. So this is a step back from comparing directly with parameters to comparing on people's segmental transcriptions. Now, obviously, what you really want is everybody's audio recordings, but that's difficult to automate and very difficult to cleanly put into a single database. So this is, yeah, this is a kind of compromise between the two that says, and also, I mean, the hope is that people will put detailed phonetic transcriptions, for example, as their original word lists, and then their transcription conversion will be taking that detailed phonetic and transforming it into a phonemic analysis of some kind. All of these ones in here are done. There are things which are not important. One, for example, I'm looking for branching onset, right, and I go with whatever language, and then I see words beginning with dm. So then I'm going to say, well, okay, that person thinks that it's a branching onset, I happen to disagree with that, so I'm going to restrict my parameters and I would say I want to liquid in C2, let's say. But at one point I saw when you were comparing languages in English and French, we saw that the ATR parameter was false for French. So I understood no tense lax, contrast in French. If my understanding is right, that that's what it means, no tense lax, contrast in French. If I didn't know French, I would not question it, but it's absolutely wrong. It's wrong. So then I'm looking at the language, that is CVT, which I don't know at all, and I see the parameter is off. So I don't go there because you're telling me the parameter is off, but it's not true. The parameter is on, and the person was not, they didn't see it. So then it's, what's the point? So like I said, the point of the parameter's bit is mostly as a tool to sort of semi-automate the process to make it easier for me to do it by hand. So in my own work I'm going through and finding, locating these parameter values and deciding on them for every language. If I go to French and look at all the lax variables and all the tense variables, it's going to take a lifetime. Yeah, that's the reason that we've got these tools to try and speed it up is the idea. But it's just that, I mean, I don't know. So the way that tense-like parameter is calculated is, are there, how many front vowels are there and how many back vowels are there? It's literally that simple. But that has nothing to do with an ATR parameter. But if you, English has got five front vowels and there's something going on, so. Yes, but then you just take the mid-vowels of French and you're fine. Well, in English you take them on. But I mean, it's not a black and white answer like that. So like I said, the idea is to try and, the idea of the parameter bit is to try and sort of speed out the process. It is the bit that, to be honest, is not really finished yet. It's just because it leads us, it forces us to take, I think, that it may take a wrong path. You see, so someone would discard French because of this ATR parameter. You would just see French doesn't have that contrast. I saw that in that program, French is out of the question. Which you, and you showed that in English. If the answer had been true instead of false. Like I said, the point of that bit is you can go on it and you can say, here are the items that caused the algorithm to make this decision and you can go look at them yourself. The idea is to not give you an answer. The idea is to give you a suggestion. And mostly to say, I already know the answer. So why has this gone differently? What's in this data that is suggesting otherwise? And like I said, to say, why have I got a vela fricative in English? That doesn't seem right. And to kind of point these things out. I just go back to the compound question. For somebody who was wanting to do this by hand, a useful thing to do would be to take this data and reverse engineer sort of working backwards from the lowest frequency complex sequences and see if they'll be the first places you would look for compound boundaries, right? Just a way of kind of working through the analysis of semi-automated tagging. So if you find, you know, the dip, yeah. Yeah. And it turns out to be pretty low frequency. You put that all those words and you probably find most of our compounds. Yeah. Yeah. I mean, if you combine that with things like, you know, is this syllable contact rising or falling and does it have different place? Or, you know, you can add in lots of criteria to make it very sort of one click, yes, no, for these items that it suggests. Yeah. Currently, you can only put tags on when you initially upload it. Currently, there's no way of adding extra ones because it's the idea that the original author has control over that. It's a good project for a PSG. Yeah. It takes a lot of tactics at the same time as using the database. Maybe you could say a little bit more about how you've used it with Siletti because we've kind of played with the tool you did as a royal wing. But I think it's very interesting because as far as sort of this tool to help discovery, it's proven that you used it. So I'm not sure how, but you point out, you know, for the sort of cross linguistic comparison it might be a little bit in its infancy. But starting from basic transcription of raw data, which is pretty much what we're doing, it's proven useful. But could you say a little bit more about syllable structure as well because I find that you went, I'm not sure that I understand how it works it out. So no, it very definitely does not come to a conclusion on syllable structure, but it does have some tools to help you do that. So I'll just start with demonstrating a bit of the Siletti which one was which. I think it was this one. So just as a context, we studied Siletti in one of our field methods class where Elizabeth was part of, very good English, and she tested her tool with the data. Was it the other set? Maybe the other set. So yeah, if we, by comparing which segments we found in different places, it becomes obvious which ones recur and which ones don't. So for example, if we say, let's doing this splitting thing again, we're interested in splitting Siletti. This was the original one. We're interested in splitting Siletti word, medial consonants. Is that one? Okay. Do it, sorry. So I have two versions of Siletti on here which I have apparently not labeled very well. And one of them is the original version that comes directly from the transcriptions with no analysis. And the second one has the transcription conversion that says actually we know that these are just allophones of the same thing. So we can put them up as they're underlying structure. So if we compare those. There we go. So we start to find that, for example, you get these sequences of G in Siletti and these only occur word medially. You don't get them word initially or word finally or anywhere else. You only get them in these very particular environments and you start saying, okay, so I think that G is not a phoneme of Siletti. It looks like it's allophonic in certain environments and you start investigating some of these. Particularly as I think I've mentioned all of these in theory branching on sets of Siletti, don't show up word medially at all. So you don't get this coder followed by branching on set in Siletti at all. You don't get it in verbs. You don't get it in high frequency items. And once you actually start going through them individually one at a time, you start seeing that actually all of every single item that I've tracked down and origin for has either been a Sanskrit or an English loan word in the last 100 years. And so this is where you go and you say this looks a little bit odd. What are these sequences? And so you can choose Siletti word initial consonants and then we've got all of the example words here. And then when you click on them you can see the orthography and the gloss which allows you to double check these things. So for example if we find, where's it gone? Here, all the ones beginning with the chair. We have cherry and chocolate. And quite an original point. Once you've got the orthography, once you've got the pronunciation cherry and the gloss cherry you go okay, we have found a loan word. It's a little bit slower when it's in Sanskrit but this is part of the other tags. So I don't know if I've got them in the Siletti tags because I started tagging the Sanskrit items and then you can search and say I want to actually exclude all of the Sanskrit items from my search and you start losing these. Yeah, grammar Sanskrit. But yeah, so you can compare for example Siletti as it was originally transcribed with Siletti in the phonemic version. And we find that, so for example the transcriptions have sometimes marked as dental stops sometimes alveolar and sometimes retroflex and there's actually only two categories. It's a bit, I'm not entirely sure which those two are but there's definitely not three. And so you can compare the impact of doing your phonemic analysis and seeing which things vanish. So yeah, again the phonemic one made it alveolar instead of dental. Let's see the other question. Oh yes. Can you go back to the calculus you did a few slides back with comparing the medial consonants in Siletti with the final consonant? Because it actually is very revealing about the silver structure. Okay, let's go with the... I'll admit the stuff that's very low frequency. Let's see, medial consonants and then we want to... Which one was that? Yeah, word final and word medium. We're interested in the two different kinds of onset all the two different kinds of coder. Let's just say however many you can have in those positions one, two or more consonants. Okay, we'll just look at onset for a second then. So comparing initial with final on the left. And then let's say where there are at least five electrical items. I may have been using the software quite a lot over the last couple of years. Okay, there that goes. The problem you get with this is that once you split them... I just thought... So this one shouldn't be more than five, it should be more than five. I just wanted before. It's a pretty simple thing, just list all the consonants you can get medially and then list all the consonants you can get internally. Okay, let's... Sorry, let's put this one back to blank. Was that pre-judging the silver structure? You can, but I didn't bookmark it earlier. So these are all of the medial sequences. These are all of the initial sequences, both initial and medial sequences, and these are the ones that just find initially. Yeah, we need the finals. Yeah, we want to compare the medials with the finals. Okay. But it's very cute. It's nice. I mean, look, it's practical. I used this to test my hypothesis so far. So, I mean, this is the origins of my work on the Siletti stuff. And I think... I know that Faith Tuen was using it to locate English, things for her experiments, and Catherine was using it for her Siletti because they were originally getting their sequences for Siletti data from Bengali and using that as their origins for consonant sequences because that's the only data they had. So... Yeah, I know. It just looked like... Yeah, I want word final on their own and compare them with word medial on their own. Well, we can't... This is what you had. Just list all the word finals. List all the word medials. Well, we're not using any of the filters. So, this is final word medial. I can go back to the original transcription if that's better. Medial final. But this is without doing the split thing that takes the onsets off the medial sequences. Um... If you want to do that... If you want... Maximum parameter, like a one, right? Yeah, this is, yeah. Are you just showing a number in this? Yeah. Like you had it and then you narrowed it. There's something, you had the results and then you narrowed them down using one. It was very nice, for example. I was like, whoa. Yeah, this was a very clear mismatch. It sort of tells you a lot of it. It's a little structure it looks like. I'm just going to keep pressing back and you can tell me when this page looks like what you want. Okay, can you do that? I'll go. It goes to this page. What's it this way? It was in the top. It'd be nice to be able to do that again, yeah. I'm afraid I'm not really sure which one it was that people are interested in. If you keep going back, you will find it. This one? It would be like way back there because it was like half of the time. Yeah, yeah. Okay, if it's in the talk, then I've got a link on the slide somewhere. If it was in this question bit, then I have no idea. Okay. But it said a lot about this lovely structure. Yeah, I mean, it looks like word-finding clusters. You don't really get them, right? Except from what I've said. No, I don't think so. I mean, I'm speculating with the next one. But it sort of looked like you've got plenty of two constant clusters word internally that you don't get word-finding. No, I think a few examples we have are loan words. Yeah, loan words. Remembering rightly. So it looks like it's based on single-finding constant language. Yeah. Which is very bad for some of these things but that's the list of coders in the language. Because the list of internal coders is tiny compared to that. Yeah. We'll have a poke at it later and see if we can find it. Yeah, this is really nice. And I think back to your original statement that it just makes regular expressions like user-friendly. So I've been trying to do this using my expressions, and it's still heavy because you have to learn the right expressions. And then you have to, yeah, by the time you learn those and you actually go back in the data. The problem I'm finding now, and I don't know whether or not this would solve it, is that I often am looking for sort of the input to the process, right? So this is obviously just showing, I mean, for me, I guess maybe it speaks to this problem with the French ATR. There's something that masks the phoneme because there's a process that happens, right? So like, for example, I'm looking at certain types of coders, but there's a process that deletes coders. So in order to find the underlying input that's sort of inherently impossible. As far as I'm found, now, does that question make sense? Like, yeah, so, I mean, this was part of this set comparison with Soletti, because this process is an assimilation in Soletti. Right. And then you start finding that certain patterns are just missing. Right. And then you go, ah, something, there's some kind of assimilation process going on. Right. And then that allows you to go back and look at your data and say, okay, I found this process by comparing you know, word-medial sequences of this length and word-final sequences. And there should be, here are my word-final segments. Half of them have just gone word-medially. So why is there a mismatch? And so how, for instance, yeah, so, for example, you can say, no, I don't, let's do it with Welsh. So you can compare word-medial, word-medial constants of length of one. And then you can say, are these, are these the same or are the ones that are actually just missing in certain environments in Welsh? Is there some kind of assimilation process that might be going on or is there some kind of extra process? So clearly there's something going on with the voiceless. Okay. Maybe it's in Welsh, because they don't pop up word-medially, or at least they don't, according to the people who transcribe this data, which is, you know, the limits of what we have. Well, one of our former students is working on a new corpus in Welsh, which might be interesting too, if you really want more Welsh later. Okay. One maybe last question. Oh, maybe not. Time-wise, how long does it take for somebody to upload their data? So you want a lexicon. Okay. Oh, thank you. Yes, that's an important, relevant point. So one of the problems you get is that if you have errors in your data, then you've, for example, somewhere, you've written down C instead of K, and then when you go to convert it into phonies, you haven't accounted for C because it shouldn't be anywhere, and then the whole thing errors, and then you're back into where it was. So I have some extra tools, if you happen to have a Windows machine, you can download them and basically check. You basically do the process that this website does. You can just run it on your own machine. And this is half of the typo spotting, actually. It happens at this earlier stage when you try and convert your transcription to the IPA and you discover that there are lots of symbols in your transcription that you didn't see were there. And so these tools basically allow you to do that. So getting, it depends what format your data is in to start with. So basically the process goes you need to get your data into the format of a text file with each entry on a line and separate it with commas or tabs or whatever it is that you want to separate it with. That's step one. If people have given you data in a PDF, it's going to be a pain. But as long as you have data that you can edit, it should be relatively quick. So for example, the Flex or whatever tools are very, very simple to export that just in your text file and turn that into the right thing. Uploading it and running this. So again, it basically takes you possibly an afternoon or two to run the checker and go, ah, there's errors in my data. I need to locate these errors and now I need to check whether there's any rotate, whether they meant to write that. And this process of going back and forth and finding errors in your data, obviously, how long is a piece of string. But it can take somewhere between 10 minutes and then two weeks. But it's useful to do because you're locating errors in your data. And then uploading it and running these regular expressions over it takes somewhere between 30 seconds and five minutes depending on how much data you have. So all of these, all of these data sets process a lot more slowly as they've got a lot of data in. So the standard sort of working lexicon if you're doing field work is probably no more than a few thousand words at which point it's fairly instantaneous once you've done all of the prep work of cleaning it up. And if you're working with, you know, a compiled dictionary of the English language over the last 200 years, then uploading 200,000 words or something is going to take a little bit longer. But still, you know, go get a cup of tea and then we're done. Um, would you do it for a right hand? Would you have the ability to do it for a right hand? Well, you have to come here and click the button and upload the file. No, sorry, no. You have to feed it, you have to... Yeah, if you have to, for example, but I assume that at this point most people are keeping their lexicon in some kind of computerized format. So if you've already got it in something like Flex or any of these database tools that people use, you know, a text file, that's spreadsheet, any of these things is quite quick to turn it into the format required here. It's not very complicated.