 So, yeah, thanks for hosting this. I have been working for the past couple of years with my colleagues at the University of North Texas Computational Linguists and Information Science professionals, professors in those two departments, as well as our own linguistic computational linguists. And so some of the things we've been looking at are my own flex projects for Matei and for Lamkang. And kind of trying to think about how to bring them up to a level where as we add more languages that we would be able to do comparison between them faster and then facilitate better descriptions of those languages themselves and also newer languages that we can add in where a lot hasn't been described yet, but with some connected texts we could speed up annotation and know where the differences are. So in thinking about that, we started thinking about the importance of kind of standardizing our, not necessarily annotation, but at least having a thoughtful process in how our inner linear gloss texts are created. And this has been done many times before, but it's been done from a really higher level. Like for all of linguistics, we would like to create standardized annotation schemes. So NSF has funded things of that nature. And they've always sort of crashed because, not totally, but they haven't taken on because communities of users have their own ways of doing things. And within those communities, individuals have their own way of doing things. And so I thought it would be really useful to start from like a smaller group of people. And so why not start with us if we have only 50 languages and only a handful of researchers, why not look at our IGT and see what we can do with trying to standardize some of those processes. So I had some of my students work with me, specifically Mary Burke who's a PhD student at UNT in the Information Science Program, Concentration in Linguistics. And we did sort of a survey of several of the already published grammars and descriptions of South Central languages that are existing. And we tried to see what the similarities and differences were for some of the major grammatical constructions that we see them repeated over and over. David just covered several of them already. But we know that IGT is useful both in the analysis stage and in the presentation stage. So in the analysis stage, we use it right from the get go, we record things. And now with types of software that are present like Flex and Elon, we're able to first use programs to do quick transcription. I use say more these days more and more because I can train speakers to use it with me and then we get time aligned transcriptions that we can then either throw into Elon or throw into Flex. And I mentioned this because these software packages often constrict or guide the way we're annotating things. So we use it for those. And as we're parsing the programs help us keep track of what we're doing and help us kind of look back and see what we've done before and improve our parses. But they also are restrictive of how much we can move things around. We have to make certain decisions based on dashes and equal signs, which are the kind of the basic units of divisions that they allow us. So we can do prefixes, suffixes, we can do in critics. And there's a way of also noting circumfixes. We can do dashes to do it. Dot, dot, dot, you know, the parentheses to do that. Okay, so each of these things seem to feed into the other. We have free translations that help us with the word glosses and so on. So the interlinear gloss texts, even at the very beginning helps us with analysis and we're making decisions as we go along. And that's a really crucial thing to remember that in these software packages, they don't help us save, I would say 50% of the decision making that we're doing. Like what are the constituents? Where does the noun phrase begin and end? Which is the auxiliary? And it doesn't help us save a lot of that because of the way that they're made. So we'll talk about that a little bit. So at the end, we might end up with something like this with these different lines of glossing. Some people have fewer, some people have less, but it'll be familiar to you what this looks like. So each of these pieces holds a lot of grammatical information. And we'd like to see that as we're working in cookie chin that we have some similar sorts of reflections of those analysis that come out of these decisions that we make. I know you've probably already looked at a flex before, but I wanted to point out that the way we spell things and the way we gloss things, we can use the concordance program to really help us with the analysis. We can use the lexicon to go back and check on what we're doing, make sure we have standardized kind of spelling. So anyway, that's the whole analysis stage. IGT is great for that. We've also talked, I think, in the field about how using both experimental data and connected text data is important. So there are all these reasons why we wanna encourage people to be collecting interlailing your gloss text. Then in the presentation stage, we're right now looking at how having kind of good IGT can help with creating pedagogical materials. And so if you have something which is very detailed, and so let me just show you an example of that. For example, I've got this very short story, which you all must have something like this in your collections, so it's very repetitive. So you have the third person passed repeated and very short constructions, like the cockroach was eaten up by a chicken, the chicken was eaten up by an eagle, the eagle rested on a tree. Repetitive kinds of simple sentences that can then be pulled out and put into children's books or into teaching materials easily. You can also use the interlinear glosses for more detailed kind of grammatical inspection. So I was just talking to Linda about this yesterday that how can we speed up creation of pedagogical materials? So if we can have a more standardized way of doing our IGTs, then perhaps we can build sort of templates for going to these collections and saying here are some lesson plans based on your interlinear gloss corpus. So if we have the same kind of lines of analysis, we could then maybe use that for pedagogical material creation as well. So what I wanted to just do with you is I was told half an hour, so I've got only about half an hour's worth of material here, so we're going to look at how we have made these kinds of decisions in the things that have already been published in terms of whether an item is an aphic, stem, root, or clodic, is it clear? How have we made those decisions? What are the implications of the ways that we've put compounds down in our IGT, re-duplication, and so on? So let's take a look first at something we just talked about, semantic change or grammaticalization. So in Lamkang, for example, you can have, as you see on the left hand side, you can have these things like yung and hung and hung as main verbs. So you can also have them as you see on the right hand side as prefixes. And so I am indicating this with this conventional distinction of either writing something in lower caps and giving it a saying this is a lexeme versus in all caps saying this is actually a grammaticalized piece of the grammar. So this is something that immediately then sets this apart from its role as a lexeme and reduces the amount of freedom that it has in constructions of what kinds of NPCs it can take in grammatical structure and so on. It's really just a bound mark. We see this as something that should be standard, but actually it is not and it's quite different in different things. So here is an example from live from, this is not really fair to Ken Van Bik because it was an unpublished handout that he used in Dartmouth when we came in to do grammar overview there. So in his analysis or what his presentation suggests is that come and down here are both sort of on an equal footing and that they are, one is not a prefix, but it's supposed, maybe it's supposed to be a compound or maybe it's a sequence of serial verbs or there's some grammar that's reflected in the way that this is glossed. And so we would like to perhaps then in our code book of how to do IGT directional verbs in the way that we're going to talk about them should be standardized or could be easily standardized I think with just a little bit of discussion of where we'd wanna use our caps and where we might use the lower forms. Let's see, how about this one? Same here, right? Or here we do have adverb.directional. Did I make a? The gloss for decaf is missing. That's what it is. Okay, so go down. So the adverb here is the coolant part and the goal is... The goal should be right here. The goal is the color part. So if you... Decaf is a separate... So we have a... Separated form. Decaf is the calendar. No, that's the same. Well, here's the calendar. Oh, okay. Color is the goal. Okay, let's look a little bit at reduplication now of which there is a lot of variety and variation in the way that we represent these things. I just put up the Lamkang verbal complex because I know that many of us have seen something similar of this sort. And for Lamkang, you can find right here in zone six, reduplicated forms that are acting as kind of, adverbial intensifiers, but the verb root itself can also be duplicated. And so let's take a look at how that looks. So this one is put put put put, which means kissing someone. So repeatedly kissing, you would say put put put put. But it has the participant markers here. Oh, this is actually just the cheek that's being kissed, right? So put put put put, and then you have a kind of the inflection coming right to the end of it. So the way that we've selected to show that this is different is by using this M dash, but in most of the cases that we've seen, we have to do this manually after we pull it out of the program. We can't use this inflex, so we can only use a regular dash, which immediately is not giving what we want. This is really one form. So I know that Zachariah, for example, you selected, and I think you as well, to just write things of this sort together as one form. So that's one thing that we could also do. Look at this reduplicated form in lamkang. So you get advante, advante in. So you've got a very large chunk of it that's zone one, zone two. And, oh, this is actually the main verb, the van, and then, oh, sorry, t is the main verb. So you've got zone two, you've got the verb root, all of that copied, and then zone seven at the end. So we want to show that this is not a compound, but a reduplicated form, and we need to find a way to figure out how to do that. So that's that. And here's the one where I was showing how zone six is copied. So you've got chen, and then dok dok, dao, and seksekra. So there, it's right in the middle. And again, we're trying to use that m dash to show how that happens. So, okay, here's how it's done in MISO. And you'll see, this is unfair in a sense that these are really legacy materials. These were done in the 1980s. And so we don't expect them to have the same sort of sophistication in the glossing paradigms, because they were done with typewriter or whatever. So there are going to be differences there, but it's very interesting to see how, what are all the choices that could possibly happen? So, very, very, so I would like to point out a couple of additional things on this one, which is gloss exactly according to meaning. Gloss in a more abstract way as in intensity, intensity, as you say on the right-hand side. And then here, of course, is a completely different type of form. Very much like the put put put put a sura sura. So we've got just write them separately in this case. There's that MISO. In Daichin, we find here yet another way of doing it, where it says you have intensifier and then vary, which is giving us sort of a meaning for what the first one is and what the second part means. There is a dash on the baseline, but in the gloss line you've got a colon giving you the meaning for the whole thing. So this is one possibility where we could say what the super category is and then what the meaning might be. So, but we don't have that standardized yet. We're still kind of playing with that. In kumi, we've got then what you had called the verb classifier. So the whole thing is written together as one and then the whole thing is just called a verb classifier. The meaning of it is not given as we had in Daichin. So that would be something that you'd want to get from the gloss, I guess, from the free translation. It's not sort of, okay. The meaning there is big. What? The meaning there is big. The meaning is big. Right, right. But I would probably find many things then glossed as verb classifier, augmented a verb classifier and they wouldn't just all be foo-foo. There may be others that are like that too, right? So, and they would all mean big. They would all mean basically. So in a way you could have known as an alibi. Or Laura, right? I mean in a funny way, but kind of. Yeah, I mean, just, log VCL is an open class. I mean it's about hundreds of elements, so. And they would all have pretty much the same meaning of, but phonologically. The points that are on VCL all have the central sense of largeness of a participant. They might have some additional lexical. So the choice would, and that would be a perfectly reasonable choice to say that this is what we want to do. And is it possible for us to see more of a general use of it in the other languages of the family it would be useful to do that. So in kyao, we see similarly things written as one form. And so in the glittering leaf part, for example, you have yao, yao, and then you have the ah, which may be a separate morpheme there, but. Yeah. And it's so written. And then, yeah, phong phong ah, but here they're written. Yeah, so the bok in the phong phong ah, what is the bok mean? Is that the in part? That's the white part. Yeah, white. Yeah, okay. So phong phong ah. Okay. And then you have similar to our pot-pot, the kissing one, you've got kong pei, kong pei, but this may be more of a, is it a more of a serial or a sequence of actions or kept doing it more aspectually? Yeah, correct. Yeah, so it's slightly different in meaning. So perhaps when we have these differences in usages, we also want differences in the way that we're going to use. When you're storytelling in South West Asia, they're like going, going, going, going, going. Something is like something like that. I know. Doing things in good. Yeah, so there would definitely be. And then in the some of, in Anolaga, we see something like a, the main verb and then the verb and then the copied form called reduplicated form. So that. So what is that? Do you recognize this? It's from, yeah. Like what the function is or? Yeah, it seemed like. Maybe, yeah, so it's an expectation, right? Do not, do not keep doing it or do not keep speaking. I really don't. We're turning to you and it's like, I don't know anything. I'm not my best friend. It's something I give that talk about verb, not verb, meaning intensive verb and that looks like that. Oh, with the not on the, but then this is a copy of our description. I mean, it's similar, but it's not the same. Yeah. Just intensification. But it's. That's what you, that's what you said maybe. Yeah, verb, not verb, that looks like it. So I feel like this should be fairly easy to do by just gathering all our examples and figuring out like exactly how we want to do this and then giving a guide to those who are new researchers who are just starting out on IGT and I want to just talk about that briefly when we get to the end of the presentation. Then STEM alternation. Similarly, there are a variety of different ways in which we're indicating those. So we have to the Roman numeral, old school here, we have B and then we also have a don't mark the A so or have a set that just don't have any marking on it. Maybe those that have no variant shown may not mark that. It's really hard because sometimes, yeah, sometimes a lot of them don't have it and sometimes they might have it but you just don't have it in your data set. You don't know whether it's a bottle stop or probably have it but you don't even want to do it. It may be a tone or yeah, people are not able to do it. So what do you do with those, right? We have Arabic numerals for the MISO and then we do have the three here which I believe is for those where there is no variant so that it shows with the three. But sometimes there's also three forms. Yeah, just give them one, just give them something. There are some that have three forms, right? With the aspirate or something like that. Okay, so that should be, again, fairly easy for us to figure out what to do with and then when we say we don't know, can we mark that or what should we do with those or do we just mark them on that? We just have a default if not marked. Okay. Otherwise, I think as a, if you have the data set that you have and if you don't know, you don't know and if you're being made to feel that every item is defective, that can be pretty depressing, actually. Don't put question marks on everything. But then we just have to agree that the unmarked doesn't mean it's actually unmarked, it means we don't know. Yeah. So, yeah. And I think that's quite a good default. Yeah. Because at least it wouldn't double worth it to be able to speak about myself and one language myself. Have a chance to speak and let your language be the default for everybody else. You set the rules for all. So, this is pretty standard. We use a dash for the affix. We use an equal sign for an enclinic. But we do have some things that are sort of hanging out there. Oh, here's something interesting though, which is not, this is an orthogonal parallel, but which is that we have a difference of analysis here where Pavel analyzes these very close constructions in Lampkang and Anunnaga, where he has the neg as a bound morpheme and I've analyzed it as an auxiliary. And so, I think there's some room there for some discussion on where are the things that have a status as auxiliaries? How do we decide that they are? What are the kinds of things? So that we have this kind of standardized idea. I can't use the word standardized. It's more really throwing out the data and saying, what are your arguments for making this an auxiliary? How come it's different from what I've got? What were the changes that caused that? Because they're so close geographically and, yeah. This is making sure that you do it in a transparent way so that people can recover your, they have a good chance of recovering your intentions and your decision making. Yes, yeah. And for me, the thing was that after the root, you can get inflection, you can get plural marking. And so, it seemed like a logical break because each of these things, the negator and the root were behaving as if they were able to take inflection. And so, it seemed like there was a break there morphologically made the description much easier having inflection and derivation interweaved. But then in the, again, sort of, I would say, legacy materials as well as some more modern things, we have these things that are hanging out that have no dedicated structural kind of assignment. And so, there are these particle type things that maybe they're meant to be particles, maybe they're not meant to be particles, but they certainly look like it from the IGT and I would like to know, I have an inquiring mind, I would like to know, what are they? Are they or aren't they? And so, there's one here, there's this one here that's actually, Ken Monvick actually calls this the Averbal Particle and these nominal particles. And that would be fine, we just want to know what that qualifies. How do you qualify for that? That means whatever Ken thinks, what Matasoff thinks particle is, particle is. So, we're gonna find out what Matasoff thinks the particle is and give it an idea. So, because it makes a huge difference computationally, whether there's white space or no white space. And so, to us, it may be, well, I understand what this is a particle, but if I hand it to a machine, it wants to know, if there's no dash, does something different? If there's an equal science, it's something different and if there's white space, it's something different. So, we want to know what that value is and how we're supposed to be treating it if compared to another language, a related language. So, okay, person indexation, we already saw this. This is my person indexation. The way I do it is totally controlled by flex because in flex you do, it says S colon third. So, you still use a subject and object. It doesn't use agent and patient and a lot of us don't use flex. So, or we manually enter it in flex. I use the automatic. I call something inflectional. As soon as I call it that, it pulls up a menu and then I can select from it the different combinations and then it gives me a templatic assignment for that morphing. So, I'm controlled by that. Others are not. And so, if we want to be able to compare across our corpora, we would want to discuss like, so what do you think the best way is to do this? And we just all do it. I think this is a very simple fix. It's just a discussion fix because all the languages are basically doing the indexation in very similar ways. So, flex forces you to think about things in terms of subject and object for them. If you want to. Possible parameters. If you want to. If I have those, you can select from that. Oh, okay. But you can think outside of the predefined. Yeah, you don't have to, but you put. Okay, good. I find one of them, I'll just go to the end later. So, here's in dye, subject agreement first singular, I guess. Is that what AGR, I think that was what it means. Here's Tadov, where we have two equal sign, or what the equal being that the orgative is, is a critic. And then here is kind of an odd thing here in this topics part of it, on prefix part of it even equals here. I guess saying that this is not a prefix, but it in place, or maybe it's a, don't know if that was standing on purpose here. Okay. So that was one part. Finally, I think subordinate clauses. So in subordinate clauses, again, we have a large range and I think we can simplify and we can really do a better job here. We have some that are these particle type. We have some where we're actually defining the semantics of it, let's say it means wild. And this is again the MISO. So it's kind of, let's go to a more modern here. And then in the LI one, in that handout, from the handout again, we've got, again, a description of exactly what it means, which is after it doesn't tell you the category. But in the Dai Qin, it has this interesting thing of having subordinate colon and then giving us, so it has the super category and then the actual definition, which may be a useful way of like our comparing, if I just wanted to compare between Monsang, Analan, and Lungkang, if everything was marked as subordinator, that would make it much easier for me to pull them all out, rather than just looking for where it said wild. In Hiao, this was really, I just wanted to stop and talk about this example. I know this was sort of an example that's not exactly from IGT, but this particular type of thing like the bracketing, we think in these terms while we're creating our IGT, but we can't save it anywhere. So one of the things that I would, and I actually went to SIL and talked to folks I was advocating for including in Flex several additional lines of annotation where I could actually go in and put in things like thematic role. I'd like to put in things like, what is the discourse value of this particular NP? Like, is it previous mention? Is it old mention, new mention? Because all of that has implications for the kind of marking, or is it gonna be zero or what have you? But again, our software currently restricts us from recapturing or capturing the kind of information we're processing while we're doing our IGT. And so it is, I think there's a huge room for improvement there. But that's what you've done here is that you've gone and put in brackets, which is very useful for somebody trying to understand. And also very useful for a machine that's going in and trying to look at what is the non-free structure like for South Central and so on. But right now you can't really do that. You'd have to give it a lot of extra information. Because we have so many NPs that are missing as well. And so being able to put that in our glossing would be great. So here we have, again, this is actually just giving us the semantics rather than the category. Okay, so that was for that. And this is actually going to a completely different topic which is the derivational morphology. And I haven't dealt with this enough, I think. But we all know that there's the zone in our verb right after the stem and before the inflection where there are a bunch of different derivational morphemes that can occur. And they look similar. And I don't know, David, did you ever catalog a number of those derivational morphemes? I've seen a list of them somewhere. And so it would be nice to maybe do a huge old spreadsheet with some of these meanings and the actual morphemes and see which languages they show up in and could we, in some ways, standardize some of these terms that we use to talk about those derivational morphemes. Maybe we can, maybe we can't. I don't know, I don't think we've tried it and it's possible. So I think I've discussed white spaces somewhat. So the way we write things is really important. And I think, Zachary, you've done a really great job in your thesis where you've got the top line is I think the writing system, the second line is the IPA with the dashes in there. So we can ignore the orthographic line when we're doing our computational work. We can go to the second line. So if we could kind of agree to do something of that sort, or if we could be cognizant of the fact that working off of orthography is very confusing for somebody who's trying to use the spaces between the words to define this as a unit and to tell a Python script this is a unit when actually that's not the unit but this is the unit. So the writing systems that we use are all varied in the way that things are lumped together or separated. And so our IGT should preferably go through something other than those or should include something other than that. This seems to be a combination of, because you've got an equal sign here but then this guy here is sitting on by himself. All right, I won't do anything with that but just have one example of that. I'm not sure how many of you have examples of this sort but for serializations that you have a dash and sure. Okay, so I'm gonna wrap up now because I wanted to just point out a couple of things while finishing up. This has been tried before for doing standardizing. So all that we've looked at right now really has to do with assigning structure to the morphology, dash, equal sign, space. And then we looked a little bit at glossing conventions but there has been a whole movement for glossing conventions. This general ontology for linguistic description that was funded by the National Science Foundation and is now defunct. The last kind of update was done in 2011 where there was kind of a forum where you could go in and say, oh, gold ontology for linguistics description community, you have a list of all of the different ways that people can gloss words but I found a new morpheme and you don't have that. So this person for example, Christian says, I found something that I would like to call the pre-inventive case for Etruscan and here are the uses of that and please add that to your list of what's available. So what I'm saying is what I'm advocating for is a gold but a gold for our community. Like what is the gold ontology for South Central description? And we can have variation but we would have a kind of a code book for people who wanted to do their glossing and who were just starting out. Those of us who've been glossing for a while maybe are reluctant. I do it this way, I'm gonna keep doing it this way. But what I'd like to keep in mind is that, well, here's what I'd like to keep in mind. I'll come back to the other things in just a second. So what I'd like to keep in mind is this third point right here is responsive methodology which is the fact that as those of you who've gone to Niels know, there are so many native speaking linguists that are now wanting to work on their languages and in India the way that people have been doing grammars has been sort of using these templates that come out of the Central Institute of Indian Languages and it would be really nice to encourage them to do more interlinear gloss texts so that they're very challenged to get at the genius of the languages that they're working on rather than copying languages that already exist and saying, well, I found the seven cases in the language or whatever it is that they might do. And so if they're given a template for how to do that, if they're given a code book for how to do that, this would make it easier for them, it would make it more kind of accessible and it would be a kind of typological instruction as well. You may find verb serialization, you may find verb stem alternation, you may find these different kinds of grammatical categories that we've discussed exist, but with IGT and a guide to that, that would make it possible. Also, more and more the technologies for cross-corporate comparison and intra-corporate comparison are becoming available. And so if we can be more systematic in the way we do this, then I could compare with Kumi and Mizo and Lai and for the next language that I work on, I could do a much faster analysis that I could see how say Hiao and Kumi and Lumkong all work with respect to pre-verbal directionals. And then I could write a script that would tell my, tell the process that I'm going through that, hey, here's a corpus of unanalyzed texts, here are all the verbs, everything before the stem is open game for looking for directionals and go look for it. And so if we've analyzed things pretty much the same way then that makes that writing of that script much easier. So this is what we're looking at now is that my Mayte IGT and my Lumkong IGT are way off because I did them at very different stages. And so they've been a great playground for my colleagues to look at and say if we could do some kind of standardizing between them it would make for much faster, faster use. So let's look at these quotes and then we will finish up with that. So this is from Benner. After language transcription and annotation the data formats and annotation standards very widely and this hinders data mining and hypothesis testing. So data formats, we're now all going into XML. So that's great. But we can go even further in making our XML look more like each other's XML because that would then make this new process of data mining that's happening. I think not a pie in the sky dream but something that we could really apply and do more historical work and all the kinds of works that we wanted to faster. Okay, thank you. Thank you very much.