 Yeah, so maybe just before I'm sharing my presentation, I'll tell you how I have come to be working on this particular thing, which is for maybe about 15 years. Most of my work was on Tibetan Dunhuang documents. And then I, and I was, how can I say I didn't like reading canonical texts because I was suspicious about what had happened to them, you know between when they were translated and when they were published and for linguistic purposes I wanted to rely on genuinely old materials. But then I was looking into the morphology of certain verbs, and I found that they just, there weren't enough of them in the Dunhuang materials, in particular I was looking at the verb for to shave. I found that in the Vinaya in on a single folio, there were all of the paradigm of the verb to shave. So then I thought, oh, I have been too hard on the canonical texts you know they are really essential for linguistic study, but in order to get back to the old Tibetan period, then of course you need to do textual criticism and doing just work on this, this short passage sort of three folios or something it has you know I haven't finished and it's been years although I don't work on it every day. And I've only collated 14 concerts. So I realized that if I was going to, you know, collate all the conjures. It was too difficult to task so I'm discussing the application of natural language processing to Buddhist textual criticism. To start off, I think it's useful to contrast the state of textual criticism between Europe and Asia. So it's one of great inequality and I just think a salient point is that the Nestle Allen critical edition of the Greek New Testament has appeared in 27 different traditions since its initial publication in 1898, whereas the canonical literature of Buddhism, Hinduism, Taoism, and so on have never been fully critically edited and I, and that's both because we have larger canonical traditions in Asia but also because we have fewer philologists. So I think it's true radically more efficient methods are needed if these vast Asian literary traditions are to receive the editorial attention that they deserve. So my presentation will touch on three research context, traditional editorial methodology, textual criticism in digital humanities, and the study of the Tibetan conjure. So what sort of definition of what do we mean by a critical edition, well, an editor faced with disagreements in the wording of different copies that purport to contain the same word which we call witnesses, either can faithfully copy one manuscript, which is a diplomatic edition, or can pick readings from multiple manuscripts to create a new version that is believed to best reflect a lost original and this we call an eclectic edition. In this case, the editor should, you know, record the different variants in an apparatus. And the critical edition is an eclectic edition that has such an apparatus and sometimes you see in people, you know, call things critical editions, which are diplomatic editions, or which are, you know, commented editions I think this is bad practice I think we should really say critical edition, only for eclectic editions with an apparatus. So looking at traditional editorial methodology, certain outright mistakes are both obvious and unlikely to happen independently. And this is the key insight to Lachman's editorial method. So manuscripts that share these indicative errors can be placed as sisters in a family tree which we call a stemma. The structure of the manuscript tradition is clear. The most ancient readings present themselves as those that are inherited on independent branches of the family. So just to emphasize this point because editors will more likely agree on what is obviously a bad reading than what must be the author's original reading this system which uses the bad readings in order to lead to the good readings is quite objective. The textual tradition is reconstructed in three steps, we can describe and call it the witnesses in for the stemma based on the distribution of indicative errors. And then at each place of disagreement choose the best reading given the stemma structure and external factors. The ability to recover the origin point of a textual tradition rests on two preconditions. First, the tradition must have a meaningful origin point and second, we have to have sufficient evidence available. That is pertinent to that origin point. So just to give two examples that are kind of extreme cases. The 394 extent manuscripts of day to be taught a day or reflect reflect a single work written by St. Augustine. And in this case, Lachman's method organizes this evidence really well. Whereas the works attributed to Homer lack a meaningful origin point because they have a fluid and oral origin. But nonetheless, parts of originally fluid traditions may go through a redactional zero point that is amenable to recovery. Whereas it's meaningless to reconstruct Homer's Iliad reconstructing the Iliad of his editor Aristarchus is methodologically possible if the evidence permits. So Buddhist sacred literature overall like the Homeric or rabbinical corpora emerged from a pool of tradition that is unaminable to reconstruction. And then, nonetheless, the systematic translation of Buddhist texts at the Tibetan Imperial Court is a reconstructable zero point. So, you know, which is to say it should be possible to do a critical edition of the conjure in a sense. Yeah. So let's make a critical edition of the Tibetan conjure. So, you know, as I was saying, that's impractical given current methodology so we see what tools the digital humanities furnishes us. And most research in computational textual criticism focuses on drawing up stemata from a pre prepared collection of potentially significant variance, a computer infers a stemma using one of three approaches and I'll go through these quickly. This is based parsimony or maximal likelihood. So, distance based methods measure a texted text disagreement disregarding processes of change so basically they just say how many letters are different in these two versions. It's, it's a, the main advantage is it's computationally easy it's quick for the computer. It's a pretty naive approach that doesn't have much to recommend it. Then parsimony methods are what most people use in in this technique. We look for a family tree that presumes the fewest number of changes it's the ideas is sort of Occam type razor right the, the simpler explanation is probably the right explanation. But as a heuristic, it assumes that all changes are equally likely and that's just manifestly not the case, right. So this approach falls prey to what's called long branch attraction, where coincidental innovations are mistaken as diagnostic in the terms I was talking about traditionally, then it identifies shared innovations, but not shared diagnostic innovations. So the third type is maximal likelihood methods. In this case, a different likelihood is assigned to different types of changes relying either on a priori assumptions or on training data. And possible stomata are then tried out to find which one best matches the, the actual data as it exists. And this is a realistic approach, but it has a very high computational cost. And in addition, change models built on faulty assumptions can yield worse results than parsimony and models derived from training data which, you know, won't be built on faulty assumptions because they're based on, you know, judgments that philologists have made are on the other hand very expensive to create. So, overall, it's a sort of it's the approach that, you know, all things being equal one would want to take that should one should take, but it's too expensive in terms of computer time in terms of philologist time. So it's not currently really used very much. So philogenetic methods, even worth the trouble I think is a good question to ask. Oftentimes there's little benefit over traditional approaches and I'm just going to give one example and I don't mean to pick on him for it but Apple in 2019 collated 12 witnesses of the aria avalokiteshvara paripurcha Saptadarmaka, and he used the parsimony based approach to draw up their stemma. But his addition merely transcribes one of the 12 12 versions which is a diplomatic edition is not a critical addition. And he doesn't even report the indicative errors that the stemma relies on. So the application of computational techniques has not in fact added any value. And because the primary branches of the conjure stemma are already securely identified, there is no need for philogenetic reconstruction. So computers are being used, most where they are needed least. And I think that's, you know, something we should change. Yeah. My proposal is instead that we use natural language processing for the steps of collation and reconstruction, not for drawing up a stemma at least not for the conjure. So, we should model and editors behavior in comparing different manuscripts of the same work as a series of natural language processing steps. We're adding a more abstract level of linguistic analysis from noting spelling variation up to understanding syntax. So there have been some efforts to computationally model of manuscript collation beginning already in the 1960s, and these are coalesced into two software products, which are called you stuff and co lot texts. And this is behind both tools subscribed to what they call the Gothenburg model and that's because they had a meeting in Gothenburg to decide on this model. And here is the Gothenburg model collation can be thought of as five consecutive steps, segmentation regularization, which is the purposeful ignoring of certain differences, alignment analysis and visualization. And this software automates the final three steps, but presupposes segmented and regularized machine readable transcriptions of the witnesses beforehand so you have to have, you know, have transcribed and regularized diplom, you basically have to have done a diplomatic edition of every single witness, and then feed that into the computer. Those are not yet widely used in transcription segmentation and regularization and just to give some examples, otherwise computationally well informed humanists working on Latin Greek and Tibetan, still transcribe manuscripts, segment the text into words and regularize the biography entirely by hand. So I think that, you know, the, let's say the model we should turn to is, I like to call it industrializing philology. So, we should get the computers to do the actual hard work as much as possible. And I turn for my theoretical framework here to Karl Marx. And he says the path of industrialization is analysis through the division of labor, which gradually transforms the operations of the worker into a more and more mechanical one. So the Gothenburg approach to dissect the problem of coalition into smaller, more manageable sub problems follows this course. But we must take their program further. So, in order to do that, let's look at what a human editor does. When a traditional scholar prepares a collation she reads each witness noting its spelling grammar and style, because scribes acted differently in different times and places, and with different genres editors consider all of these features. This segmentation, i.e. describing what to compare when you're collating requires an appropriate theory of scribal practice and regularization, which is deciding what to ignore while you're comparing requires an appropriate theory of scott scribal Caprice. I'll give one example, which is medieval French poetry scribes who were copying medieval French poetry freely changed the tense mood and person of works, the same scribes actually copying Latin poetry would never do that. Yeah, so that's just important to keep the concrete circumstances in which the scribal practice was happening in mind. So, so for medieval French, the regularization step of our workflow should ignore inflection and account only for distinct words that's lemmatization in NLP terms in a sense then coalition of you know medieval French manuscripts needs a different needs to target a different moment in the NLP pipeline than it would for for example, Latin. So, so mass describes the ideal editor in the following way says the most reliable collator best understands the text, but is able to switch off this knowledge in order to work with mechanical rigor. So a computer innately applies this rigor, but permits alignment at only one level of analysis, different levels lead to irreconcilable collations maybe I'll just like dwell on this for a moment. So when editing something by hand, you know I can decide in this case I'll look at all compare whole phrases. In this case I'll compare words. In this case maybe even I'll just compare letters, depending on you know my knowledge of the state of the field, using collation software you just feed in segments and then the computer compares them so you can only decide I'm looking at, you know, one level of analysis I'm either looking at. I'm either ignoring inflection or I'm not ignoring inflection I can't sort of turn it on and off and in some ways that's good because it means we have to be more explicit about what we're doing and rigorous but means we also have to teach the computers in a sense to be as capable as the philologists are. So existing computer approaches do not permit nesting units of comparison, even though they are frequent in traditional practice. So as the effectiveness of a machine relies on the power of knowledge objectified to again quote marks for each witness we must objectify all levels of linguistic analysis into explicit annotation, and then apply the alignment algorithm to this united goal. So this is my proposed workflow. We divide the Gothenburg five steps of collation into eight steps, five for inspection and two for collation. Now, this is not I don't want to present this dogmatically. You know, if we if a project actually got up and running and maybe we would need 11 steps or 15 steps or whatever. Yeah. But here they inspection, which consists of transcription segmentation orthographic regularization part of speech identification and syntactic analysis. And collation which is the actual alignment analysis and visualization so basically the last three steps are, you know, in principle automated, but the first five are not yet. I just want to flag the risk of compounding error rates. So more full automation faces the obstacle that errors compound at every step. And just, you know, to as a kind of thought conceit, if you like, if each step was 97% accurate, then this eight step pipeline would yield an overall accuracy of 78%. So it's just to say that you know the the accuracy will go down over time. But actually these don't don't put too much stock in these calculations because that calculation assumes that the errors at each level are independent whereas of course they're not they're going to rely directly on each other so the concrete results might be a lot worse or might be a lot better depending on exactly what kind of error is made. But in any case, you can bring down the error rates by increasing the manual training data at each step. But the manual preparation of data required for training statistical models is always the most onerous aspect of NLP it's extremely expensive to compare to compile manual training data. And it's very boring to do it if you're the person who has to do the training. So we can get over this bottleneck using a relatively, well, at least in this area. New technique, which is called active learning and I'll describe that here in a moment. The computer guides the editor to annotate just those examples that most improve the performance of the models. In this way, the number of cases requiring human intervention is kept as low as possible. And you can think of it, you know, basically as a trade off between with manual training data you might have a lot of students or even like monks in somewhere do the more basic tax tasks manually. It's not expensive because it involves a lot of human labor, or you can have the computer itself tell you which examples would most improve its own performance, and then have like an expert philologist, just annotate those examples, and then retrain the whole system, which is very expensive in terms of computer time. So there's a trade off between, you know, philologist time and computer time in that way. To date, there's only been one paper as far as I can tell that has applied active learning in digital humanities, and only for one NLP tasks so there's a lot more that could be done in this area. So now turning to the conjure, the comparative conjure the conjure paid your most sponsored by the China to bet all the research center in Beijing includes an apparatus, and it was begun in 1986 and published between 2006 and 2009. The text itself reproduces the 1733 dergue version, but it records variants from seven other would block conjures produced between 1410 and 1934. And, and I want to actually just hold this up as you know really the pinnacle of conjure editing so far, you know some people are hard on this project. So I haven't produced a complete edition of the conjure so so I think that you know this was a wonderful project that has had great results. The eight collated witnesses reflect only one quarter of the available evidence, excluding all manuscripts from Bhutan from Nepal and Ladakh, and those held in foreign libraries, in addition, by sticking to the 1733 text, which is to say by doing by doing a diplomatic edition, the editors made no use of the variants that they so carefully assembled and now I think that was prudent in many ways but it does mean that we don't yet have a critical edition of the Tibetan conjure. So the limits of this 20 year project, I think that's what I want to emphasize was, was that, you know, this was a great project, it was very expensive took a lot of work, but only got us so far and I think that those limits foreground the need for a radical change in methods. So, just to give you one sample I referred at the very beginning that I to the fact that I had edited a short passage from the, from the, the vanilla. So I did I compared 14 conjures, which is sort of less than half of the available conjures. And here I've circled the ones that are included in the pay Dorma. And so you'll notice just from that structure that half of the tradition, you know, is not reflected in the in the pay Dorma collection, simply because they've omitted the Dolpo conjure so it just shows you that you know you. Yeah, even if you've edited a lot of conjures it doesn't mean that that one you haven't edited. You can actually hold, you know, some really important information. So I would, you know, like to propose that that there be a complete critical addition of the conjure by complimenting traditional scholarship with cutting edge technology this project I think it becomes feasible, particularly in China where the state of NLP is so advanced and and including for Tibetan. So let's just sort of reiterate if you like the methodology. First we use natural language processing tools to model a critics understanding of a text. Then we use active learning both to maximize the impact of critics labor and to minimize the errors that can attend a long NLP pipeline. So two overall approaches, in terms of how it would be done in a linear fashion. First, you, you transcribe the manuscript so so transform the images like, you know, J pegs into bear e text into something like, you know, unicode txt file. Then inspection, we transform the bear e text into increasingly linguistically annotated e text. Then we compare these enriched e text to each other. Then we restore the archetype in as far as possible. And, well, and then and then we've done it. Yeah, as long as you record all the variance. And just a look at the Tibetan NLP tools that would be necessary for doing this. They by and large exist which is not to say that some of them wouldn't have to be improved but they by and large exist. So, for instance, there is a handwritten text recognition. And, and in this case I'm just putting citations for, you know, papers that discuss the tools involved so handwritten text recognition segmentation, orthographic regularization, part of speech tagging and parsing. So there are, you know, there are solutions to each of these steps. And that's, that's all of my presentation. Basically, so I, you know, I'm, let's say, to some extent it's a sales pitch, you know, I think that those people who are interested in in canonical Buddhist texts is particularly those that exist on a large scale like the conjure might want to think you know, some, some, some NLP approach to editing the conjure. So thank you very much for your time and attention. So, in a sense, the answer is no and here's the reason why is, is, let's say I can point you to the specific successful NLP tasks. So for instance, the, the, the BDRC, you know, the Buddhist Digital Resource Center, they are working on OCR a lot now. So, especially for kind of nicely printed Tibetan texts, but also increasingly also for xylographs. And they have, I think, together with this organization called Asukia, they have done, they've made digital copies of three conjures. I think they're the, they're the lead tongue. I don't remember, I would have to check, but any case. So, for, for example, we do have now, you know, e-text of three different conjures. And, and those have used some quite fancy, especially for the xylographs, some quite fancy digital techniques. But of course, you know, what I'm proposing is we have to do that for the other, you know, 25 conjures, right. You know, like James Apple is, I think the person who has done this kind of work most in terms of using actual collation software. But as I was saying, you know, he, the poor fellow, he does all the transcription himself, he does all of the orthographic regularization himself. Well, he prepares everything, you know, painstakingly for the computer, then runs it through the collation software and then is told that the stem, the Stombam, sorry, the stemma. And it turns out it's the stemma we already knew. Yeah, so, um, and, and maybe what I would say is I think that this kind of work he's been doing is extremely important. And just in order to, you know, because let's say now he knows how to do it, right, which is to say like this is quite technical work and just getting those skills is important. But I think that the vision of actually combining NLP tools to do textual criticism has not yet really happened. But there are some I've, I've found some especially young scholars who are doing their PhDs on Latin, who have gotten pretty far, and I can send you some citations if you want for that but I don't I haven't seen any work that's been done in this area in although I would say that there's the tocariness in Vienna, of course, you know, they generally only have one fragment for each text right it's not, they're not needing to collate things, but they use some quite fancy tools. So they're also a good people to look at. So the problem that currently exists is that you have to convert, you know, your document into a string of symbols. And then you have the computer compare that string of symbols. So one way to do it let's say the way we do it in English actually is exactly what you're saying which is they just they actually just put a space, like let's say you have singing, you would just say sing space in. And then you would compare the roots and the inflection separately so that works for English, but it doesn't work for Tibetan, for instance, because Tibetan has a lot of work I mean so does English a little bit but Tibetan has a crazy verb system. So what you have to do in a language like Tibetan is you have to go through one one step where you do it in that kind of linear way and then you come back again, and you say something like zoom, you have to refer to zin, you know, like, you have to have some other step is called limitization. No, you can make that you can do that with software. Yeah, and we have software that does that actually for Tibetan. But let me just compare let's say Tibetan and Chinese. So for Tibetan, the easy part is recognizing the letters because there's only 30 of them. So that means that you're that early on in your in your in your natural language processing pipeline, you have very good results. So it's easy to get good results. But then later on, when you're doing something like, like referring some to the Psalms, then it's harder. And so that step might be worse. Whereas if you were doing it with Chinese. And then the step of actually just seeing what characters on the page is going to be, you know, perfectly possible. Yeah, but require more training data or require more active learning or somehow will be harder than it is for Tibetan. Yeah, but then later when you come to the limitization step, it's basically you get it free, you get 100% performance for free, because there's not any complicated morphology. So I would say basically any one language will have its own sort of weak points and strong points. And but those will appear at a certain moment in the in the workflow. Yeah. But but in principle it can all be dealt with with enough time or money. Yeah. And, and the point I'm trying to make is, you know, even if it costs I don't know $10 million or something like that. But it's going to be a lot cheaper than doing it the old fashioned way. I want to appeal to, let's say your, everyone's Buddhist nationalism in a way with this with this point about the Greek Bible which is to say, how can it be that the Greek Bible has been critically edited 28 times. Yeah, but that the, the, the conjure of Buddha Vachna has never once been critically edited. I mean, that's a history of scholarship thing. So let's say, let me just say where I'm where I'm taking this from is, is, is, is Paul Moss, who, you know, which is what we use to learn textual criticism back at Harvard. And there's a very nice recent book by an Italian guy who's actually a Dante specialist that I relied on a lot. His name is Provato. Yeah, this I mean, let's say I think that this stuff happened to Alexandria, but there's actually, there's also a tradition that Leonard has written about, you know, in, in, in Buddhist circles, and I actually have that in the written form of the paper but like say, people have been co-lating, you know, Taisi to Chuki Jumne did this. So I should just make clear that this presentation is is actually a sort of manifesto right like it's it's it's something I think everyone. It's like something I think we should all do. It's not something that I'm doing because I don't have money for it at the moment. I actually asked the British government, I asked the British government for some money to do it and they said no. So, so, so maybe I'll ask again in a year or something like that. But, but what I would say is on GitHub, there's a place where Tibetan NLP workers have are all sort of collaborating a little bit. So and on GitHub and on Zenodo and you know, these things are quite easy to find but also I'm very if someone wants to email me I can send you some links. Okay, so the question in a sense is if everything works well what would happen. And here's what I would say is I would say that we take like, you have a little machine, and you put into it photographs of the manuscript pages of all conjures. And I think there's about 40 now. Yeah. And then out the other end comes a something that looks a lot like the conjure page. I would put the notes at the bottom of the page. But there would be an addition with the very orm at the bottom. Now, if that's done entirely automatically which I think it could be. What problems will arise well all kinds of problems will arise maybe there was one spot in one manuscript that was particularly blurry, and the computer has mistranscribed it. You know, that's that will happen. Or maybe in some cases, it has chosen a reading that is consistent with the structure of the storm bomb. Sorry, of the stemma. But that happens to not be the, you know, in some absolute sense, the right reading maybe, you know, there needs to be, for instance, the computer cannot do. It's called conjectural amundation. So, you know, in many cases, it would need a good conjectural amundation to actually solve a difficult philological problem, which is to say, you know, like, am I trying to make everyone unemployed by replacing them with machine Well, in a sense, yes. But in a sense, no, think of it just as, you know, if you could go to the bookshelf and there was a beautiful critical addition of the conjure, you would still have to exercise your intelligence in using it. And you would still have to go back to the original manuscript for some purposes. But now I think you see people and let me just give you an example like Shane Clark, like let's say, Buddhologists people who who work in Sanskrit Chinese and Tibetan, they just kind of choose like the talk palace conjure because it's easy to read, and because it's relatively good philologically. I think that's a very practical choice, but you know, think how much better it would be if rather than just using the talk palace conjure, because you can't, you know, you would stop being a buddhologist if you read every single passage in all 40 conjures Right. Instead, you could just turn to this automatically made critical addition. I think that would be good. Thank you so much.