 Okay, so today I would like to present a method that I started working on for my dissertation and I have continued to work on since then, on and off, a lot with my colleague Kevin Ryan who is at Harvard in linguistics and more recently with a colleague, Taylor Arnold who is at my current institution, University of Richmond in the math department and the method has just been developed specifically to sort of guide reconstruction of the big data but I would like to think that it has some broader applications and so I will be as always interested in your particular method but I'm at least as interested in your ideas of where we might actually apply it, say elsewhere. Should I stand more this way? Okay. All right, so just a brief sort of intro to the rig data. This is the oldest text of the Indic branch of Indo-European, also called Indo-Aryan. The language is likewise called Vedic and unlike classical Sanskrit, it's a living language and we know that because we watch changes or observe changes in every part of the grammar between the oldest Vedic text which is the Rig Veda and the sort of grew middle Vedic and younger Vedic until it sort of fixed this classical Sanskrit and what the Rig Veda contains are a bunch of praise poems mostly praising the deities of the Vedic Pantheon and sort of trying to motivate them to attend the ritual where the poem is being performed and sort of benefit the ritualists. The entire text is in poetic meter and all of the meters have in common that they regulate syllable count and also syllable weight distribution and some of them also have Cesare, that is to say points in the verse line where a word or phrase break is required. And it's important to know about Vedic that there's a two-way syllable weight distinction between light on the one hand and heavy on the other and if a syllable ends in a short vowel only it counts as light and all other syllables are heavy, that is to say long vowel syllables, short vowel plus consonant rhymes, long vowel plus consonant rhymes and so forth. So here's a sort of rough representation of the eight syllable verse pipe and so I'm using the sort of X or a sign for, I should start with the breath. I'm using that to show positions that are usually implemented with the light syllables about a third of the time, only heavy about a third of the time. Then the X just shows relatively free positions that are sort of only, that are heavy anywhere from one third to two thirds of the time and the macron or the long mark shows you preferentially heavy positions. And you can see that there's essentially like an iambic rhythm to the verse that goes, you know, da-dun, da-dun, da-dun, da-dun and there's sort of two additional and well-known principles at work here. One is so-called, I think this is on the next slide. Yes. One is final strictness and this applies to the verse line and it essentially says that the later in the line the more strictly syllable weight is regulated. I guess they're good typological parallels for this as well. And then the other one is final indifference and this just applies to the last position or the last syllable of the verse and that's relatively indifferent to weight. So even though, you know, analyzing this as an underlingly iambic meter, you would expect, you know, to find heavy syllables in the last position, final indifference sort of permits you to implement that with either a like or a heavy syllable. And the 11th syllable verse line either has a caesura after the fourth position or after the fifth and you can see it's sort of there's an iambic opening like da-dun, da-dun, da-dun and then da-dun, da-dun, da-dun, da-dun, or you could go da-dun, da-dun, da-dun, da-dun, da-dun. And so I have to stress the preferentially heavy positions to sort of be able to hear them because I'm an English speaker, I guess. And the 12 is really just like the 11 except that you sort of descriptively insert a preferentially light position in the penultimate position of the verse. So it opens the same way da-dun, da-dun and then you have the same sort of da-dun and then instead of having da-dun, da-dun, da-dun, you have da-dun, da-dun, da-dun. Okay. So I'm almost over with the description of the meter, so bear with me. So, okay. And so you basically, you just put these verse lines together in threes or fours usually, sometimes other combinations to make up a stanza and you can show that there's more structure than just the line, it's not a purely stickic verse type, there's evidence for a couplet like structure and for stanza structure and so forth too, but for us it's really just the verse line that's important. So most of the Rig Veda, 83% is composed in verse lines of the three types that I just described and so for the rest of this talk I'm just going to be excluding the rest of them as sort of unusual meters to be able to do some comparisons that I'll talk about in just a second. And so you can tally the number of verse lines which are called patas by Sanskritists usually and then you can sort of calculate the corpus size this way by multiplying the number of patas by the number of syllables in the pata to get a sense of how big the corpus is. Okay. So the text we think was maybe composed around 1200 or Professor Hill suggested 12th century, I don't think it really matters for our purposes. We don't really know exactly. It must have been composed over a relatively long period of time because we do see differences in the language of the older parts of the text and the younger parts of the text. And then it was transmitted orally for many centuries in an incredibly accurate way and that's thanks to these feats of memory and those were of course more motivated by the importance sort of given to direct recitation in this particular culture or subculture. But nevertheless there were some changes made to the text and some of those involved the replacement of linguistically in the text older, more archaic forms with younger ones and sometimes the younger ones are the ones that are familiar to us from classical Sanskrit. So you have some basically some old Vedic stuff getting replaced with some younger classical Sanskrit stuff. Okay. So one of the examples that is usually given because it's almost completely exceptionless here involves a sequence of a consonant in the transmitted text. A consonant followed by a glide, either a ya or a wa, which is written with a v, translated with a v, followed by a vowel that has like a grav accent mark on it. And that grav is originally representing a falling pitch. And this is also essentially the only place where you find a grav accent in the text. So it's a pitch accent language as a rule with some systematic exceptions. Lexical words have one syllable that receives a high pitch and the rest of you know the rise to that high in the fall from that high back is sort of fanatic. So this is a little exception to that. And I should also say the accent is almost if not purely morphologically determined, the placement of the accent. Okay. So wherever you find this sequence, that is to say consonant, glide, and then vowel with falling pitch, the verse line is one syllable short. So this sort of paying attention to the meter motivates the reconstruction of an extra syllable. So just as an example, I took the word for sun, which is transmitted as swar. And I just took some eight syllable verses with the phrase swar darshi, which means to see the sun. And you can see pratyan vishvam, swar darshi is seven syllables. But if you reconstruct swar darshi, then you get eight syllables. And so this is, I think they're like maybe, they're hundreds if not thousands of examples of this kind of sequence. It's almost perfectly regular. It's a super clear case of basically paying attention to the meter and then restoring a slightly older phonological sequence. And so the change that happened obviously was the gliding of the high vowel. And this seems to have introduced a new accent contrast to the language because the pitch fall was then phonologized apparently as falling pitch. So swar had, that was at a stage where there was just one high pitch accent syllable. And then you say swar, and there's falling pitch still on what's now the only syllable and that gets phonologized as the accent that we write or transliterate with the grove. I hope that was clear. So that was an example where we know and have known for a long time that we need to reconstruct an extra syllable. And now comes an example where we need to reconstruct something that just involves syllable weight, not count. And so this is another just classic textbook example. So there's a root, which means something like be compassionate, take pity. And wherever the root is followed by a vowel, that is to say the syllable as spelled, the first syllable consists of M and a syllabic R. The rigvata, the, you know, paying attention to the meter of the rigvata shows that the first syllable must have been heavy, right? And so normally myrrh would be a light syllable that consists of just a short vowel or counts as a short vowel. But it appears to either have been a long vowel or a vowel close by consonant. And you can tell this again by looking at forms such as the imperative mrdaya, which means take pity. And the poets regularly, you know, place this at the end of eight syllable verses where we actually expect den-de-den. So we expect something like mrrr-da-ya, right? Or end here. And so this is just, you know, again, like alerting us to the need for some reconstruction and then we'll actually use the comparative, well, internal reconstruction and the comparative method as always to insert the right form back into the text. And here the fact that the D slash L is retroflax plus the Iranian evidence where you find, you know, mrrr-djid with a consonant between the syllabic R and the D and a vestan is going to motivate us to reconstruct something between mrrr-djid and mrrr-djid for the rigvata, whatever that was. Certainly it was, it still counted as a heavy syllable then. Okay. And then, so those are pretty exceptionless cases. And then there are cases where you actually have very good evidence for variation. So, and this seems to have been sort of up to the poet to use whichever form, you know, he wanted to. And I don't think that this variation has been very closely studied, so there may be more factors than just metrical factors going on here, but it looks pretty, pretty metrically determined to me. And so a well-known case is the date of ablative plural suffix yes, which can be realized either yes or yes wherever it comes after a heavy syllable. And that's just up to you as the poet. And then there's the older form of the genitive plural ending, either a ahm or a ahm, that's very hard to tell. And the younger form ahm, so it's only transmitted as ahm, but we would see that essentially the poet got to choose whether he wanted to use monosyllabic ahm or disyllabic older ahm when versifying stuff. And so for example, if you were to close an eight-syllable verse or a 12-syllable verse, and remember those have a diambic cadence, they go badam badam, then you will say jnan ahm, to mean of the people. And if you are composing an 11-syllable verse, then you're going to say jnan ahm, because it gives you the right rhythm. And so I guess many Indo-Europeanists would reconstruct this genitive ending as ohom, or something like that. And I think pretty much everybody agrees that it sort of survives into both Vedic and Avestan as ahm. And then there's the younger form with the very unsurprising looking contraction. And another reason to assume that it was originally disyllabic is the accent of the Greek genitive ending, ohm, which is circumflex, and should probably be derived from a two-vowel sequence where you have ohm. So I think that's the standard line on that. Okay, so everyone agrees on the examples that I've shown you so far, I think. And people have studied this very closely, including a lot of people in the mid and late 19th century and early 20th century, Hermann Ornbeck and Yvern and Arnold being two of the most prominent. And many of those were then adopted into a metrically restored, that is to say reconstructed text that was published in 1994. And so is there any more to do? And I would say yes, I think we can do more and also should want to do more. Because we are now in a position, I think, to be able to say relatively decisive things about forms that are much less frequently attested and or occur in parts of the verse that are not as strictly regulated. So basically we can, you know, we, thanks to some advances in statistics and stuff like that, we can sort of be a little bit more sure about things than, say, well, than 19th, early 20th century scholars who just didn't have the same sort of tools that we do now. So here's an example, a verb form, that means something like I could be Lord, I could be Master. And so that has the so-called optative suffix, so that's sort of an Irialis type form followed by the ending for the first person singular in the middle voice, which is just ah here. And so the form is only attested three times in the Rig Veda. And here you see, you know, this is the first eight syllable line, second, third, fourth, and it occurs at the end of the, of an eight syllable line where you expect ee shee yah, not ee shee yah. So this is a, but if the other, if you look at the other two attestations, they're in the first half of an eight syllable verse where the meter is much less strictly regulated. And so they, you know, they're not, people would not normally feel these to contribute much or any evidence one way or the other. So what do we do, right? We say, well, it is in the cadence that one time, so I don't know how often do you expect that sort of thing to happen. It's unclear. And so I think the right thing to do if you aren't in a position to sort of do some careful statistics is just to say, well, I'm not sure. It might just be one of the departures from the meter. I don't want to make too much of it. And that's sort of what Oldenberg did. And then others, you know, felt freer and they just did whatever they wanted. But Oldenberg was a very sober worker. So, okay, so what we're going to do now is first note that the localization of a word is partly phonologically determined in, just by the nature of metrical composition. And then we're going to compare the way that poets localize a particular word with the way that they localize all of the other words that have the same shape. And when I say shape, I mean phonological properties that matter for the meter. So number of syllables, syllable weight, distribution. And note that we also have to pay attention to the onset of the word and the rhyme or the end of the word just because of the way recelabification works across word boundaries. So if you look at this, this is the word hum-ai, right, e-shi-ya, would be Lord. But it's syllabified ha-ni-shi-ya. So that's an example of a right word recelabification of the end. And then here's an example of left word where ma-hi-plus-kri-ya gives you ma-hip-ri-ya, hip then being closed and counting as a heavy syllable. So these are the things we have to pay attention to. Syllable count, syllable weight template if you will of the word, and then some things about its edges. And so then here's the way that I'll be representing this. This gives you the weight template of the word, so light, patty, light, and then the final syllable, its weight typically depends on what follows. And so that's why it's an ax here. And it just starts in a single consonant and ends in a short vowel. So common words, frequent words that belong to that shape class are purush-tuta, much praised, parabati in the distance, the dha-tana yoput, the parative and that sort of thing. And so as you would probably have guessed, where the poets like to put this sort of a shape is verse finally in eight syllable verse, where it gives a nice ayandic rhythm, purush-tuta, parabati, and so forth. And so what we'll do is we'll just look at all of them and then we will express the pattern, the localization pattern of that class as a vector. And so this just means that there are six that they put in the first position of an eight syllable verse, so the verse would start with purush-tuta. You see that? You get a one in the first position. You see that five more times? You get a six. Zero times starting in the second position, one starting in the third, zero times starting in the fourth, and then starting in the fifth, which is the latest you can put that in the verse, purush-tuta, five, six, seven, eight, right? That's where almost all of them, they've put almost all of them when they're composing an eight syllable verse. Okay, so then in eleven syllable, in this particular shape class, we see that they localize most of them. The beginning of the verse, seventeen of whatever that is, twenty-six. The post's early position is, starting in the fifth position is another spot. They'll put them, and then in twelve syllable, looks like eight syllable, they just put them line finally because the cadence there is likewise diden-diden. So, okay. And then we just put the three vectors together into one long vector, and so now we have captured sort of as a mathematical object or whatever, so what you might think of as like the metrical fingerprint of this shape class. So now we're in a position to compare individual items to the entire class, right? And so that's what we're going to do. Thank you. So, and I don't want to suggest that, you know, formerly people were doing bad work in this area, and now finally I'm doing really good work in this area, that's not true, but there are some advantages that I think we can point to, and one is that we're just including a lot more information now. We're not just looking at the cadence of the verse, we're looking at the entire evidence, and even though the earlier parts of the verse are not as strictly regulated, they are regulated, and so these are informative things that we're adding to the picture. We're also taking account of the relative frequency of a class in the three verse line types. So, for example, the type that we just looked at, the poets like to use that better in 8 and 12 syllable verse than they do in 11 syllable verse. You can kind of come up with an expected frequency in each verse type just generated by the relative size of those three sub-corporate, and you see that they're either avoiding them in 11 and or preferring them in 8 and 12, and so that gets sort of captured here as well. And we can work with words shapes that just aren't fit for the cadence, so they only are localized in less regulated parts of the verse, and if we do the math correctly, then we're going to be treating the infrequently attested items exactly, very exactly, and we can even say things about things that are attested three or two or one, one time. Okay, so I think I should, well, I think maybe, okay. So we could say something like, well, what's the probability that the tatana expressed as a vector belongs to the class of light, heavy, light, X items that are shaped the same way minus the datana, and so we'll get a probability value, and those will all be very small, so we'll just take the log of them so that they're easier to work with, and you don't have things like 0.01345 or something like that. And then, yes, and so obviously when you see something like negative 32.5, that doesn't mean anything until you put it relative to the other log probability values that you're coming up with, and so just to give you a sense of this, the log probability values for the individual forms of the rig data range from negative 918, that's the least probable to belong to its class to negative 30, which is most probable to belong to its class, and for each class we can also come up with an average, so the average for this class is negative 31.5, so very similar to the negative 32.5 that we have for the tatana. Now, classes depending on the shape and because of differences in shape and so forth, their averages are very different from each other, so they range from negative 250 roughly to negative 30, and that's because what I've referred to is for tight and loose classes, so the one that we just looked at is tight in the sense that there are only a few places where the poets can fit them into the, or do fit them into the verse, and something like CVC shaped monosyllables, that's with a short vowel, they put those in almost any position in the verse, and so it's in these sort of loose classes that you really get to see the other things that are sort of determining word order and the rig data, like syntax. So, okay, so, alright, so here I'm just reminding you of Ishiah, and notice also Rasiyah is another octave like this, where you would expect Rasiyah, I would give over, and so this will be our very brief case study, so we'll just look at all of the octave forms of the first person, octave forms that end in E, these first person singular medial forms, and you can see they're just 15 of them, so 15 tokens in 1, 2, 3, 4, 5, 6, 7 types, so this is not a well-attested class, and so for each form we're going to make two comparisons, we're going to compare the form with its shapemates, and then we're going to compare the form with its putative shapemates, that is to say, or for example we're going to compare Ishiah with other things that are shaped like Ishiah is spelled, and then we're going to compare Ishiah with everything that's shaped like Ishiah, the form that we're considering reconstructing or restoring, and then we're going to say, well, is it more probable that Ishiah belongs to its apparent class, or is it more probable that Ishiah actually belongs to the class of things that are shaped Ishiah, and so here's, so we, leaving Bakshia aside for a second, we do find a kind of a distribution, something that looks like it might be a distribution here, so for Ishiah and Rasia, those two seem to have Ishiah with a short eye, that is to say something to reconstruct. With the other ones you seem to have a long eye, which is the classical Sanskrit form as well, and the one that's transmitted in the text, and it also seems to be the case that you get Ishiah, the sort of reconstructed form after a heavy syllable, and Ishiah after a light one, that doesn't look crazy in a language that has a fair amount of morphophonology that promotes syllable weight alternation, and is also in an environment like Iambic burst that promotes syllable weight alternation, and Baksh, so by this distribution we would expect Bakshia with a short I, because Bak is a heavy syllable, but we find better evidence for Bakshia with the form that we expect, and so this is as far as this takes us, it sort of suggests, I would say the method suggests that we take the reconstruction of I with a short eye, at least a variant, pretty seriously, and then now we just go back to doing what we always do, which is internal reconstruction and external reconstruction, and what I suppose, here's a scenario that seems plausible to me, I think we know what the etymology of the sequence is, it's and so I guess that may have just, that I with a short I may have been the regular outcome of that, that's also what Brogman thought, but he didn't know about the two laryngeals, and then you get Ia changed to Ia by analogy, and there are various ways to do this, but if you wanted to do it differently, there's four-part analogy, I guess you would say, it's something like a top, which is, by the way, this I is a long vowel, always a top is Ia, as I top is the x, and x you would solve for Ia, and you could also do it differently, if you like, and so that seems like a plausible source of the younger form, and then what we would have is another sort of a poet's choice situation where you have an older Ia that's still around, that the poets opt to use, especially after heavy syllables, because they're composing in weight-alternating meter, and then you also have the younger form that we know from classical Sanskrit as Ia, so thank you for your attention, that's it in short, and just to remind you, I am extremely interested if you can think of other places where we could sort of try this out, and mainly we, I think, just need some sort of text, which is in the organization of which is fairly phonological, obviously not completely phonological, and where we need to figure things out about phonology, thank you.