 I'd like to thank the Nathan for inviting me to speak about my take on this subject. I was wondering how I would feel 20 minutes by talking about data management. And as I put this together, I realized that I could probably talk about two hours about it. Unfortunately, what always bothers me about my specialty is that every time I give a presentation, I have to waste several minutes explaining what Pew is. And unfortunately, yeah, you'll have to sit through that because there's no real point in me talking about data if you don't even know what the data is of. So please bear with me. The Pew is a modern Burmese exonym for people who once lived in what is now called Upper Burma in walled cities. We don't know what the Pew called themselves. They flourished during the first millennium CE and they assimilated with Burmese speakers who arrived around the 9th century CE. The last Pew text is from the late 13th century CE and their language has been largely a mystery for a very long time. At this point, we know very little about the Pew language. We know that it is a Trans-Himalayan aka Sonos Tibetan language, which is somehow related to the language that ultimately replaced it Burmese. But apart from that, we really don't know much. What remains of the Pew language are three kinds of data. First are Pew language epigraphic texts in an index script, this background image here. The second type of data are Tang Dynasty Chinese transcriptions of the Pew language in Chinese characters. There's the Chinese transcription of the ethnoman Pew in red. These are very extremely limited in number and difficult to interpret. The third remains of the Pew language are some borrowings in Burmese. If you attended last month's workshop, one of the speakers talked about potential Pew long words in Burmese. This is all that remains of the Pew. Pew has left no descendants. We just have two kinds of textual evidence in the Pew script in Chinese script. The third type are some words in Burmese that are thought to be of potential Pew origin. Most of my work involves the Pew language inscriptions, which are the largest body of evidence concerning Pew. These inscriptions are documented in several formats, one being photographs. These photographs vary considerably in quality. We have both photographs taken from the Colombe, from when Burma was a British colony, all the way up to the present. Many of these photographs have been taken by the photographer James Miles of Arqueobision. One might think that new photographs in color and high resolution so forth would be inherently superior, but often this is not true. Other photographs are better because items have been less damaged in the past. The photo here is of the most famous Pew inscription of all, the so-called Pew Rosetta Stone, which has four sides and four different languages, one of them being Pew and the other three being in other languages. Photographs also include rubbings of inscriptions. The previous photograph I showed you was of objects, but we also have photos of rubbings, including unpublished rubbings in the collection of Professor Janet Stardart of Cambridge. We also have so-called reflectance transformation imaging files, or RTI files, that were made by Oliver Riffens and shown here by James Miles of Arqueobision. These RTI files are made from multiple photographs of a inscription, and they are put together into a computer file that, when viewed with appropriate software, can simulate the experience of viewing the same inscription using different kinds of light. In this example here we see that these two vertical lines, which are Pew punctuation marks, you can see different aspects of them in different lighting. This is a single RTI file, but by manipulating software I can change the lighting and see things in different ways. We also, James Miles of Arqueobision has also made for us 3D photogrammetry models of smaller Pew objects. We are able to rotate these things enlarged and so on. These are the kinds of files that are generated from objects, photographs, rubbings, photographs of rubbings, and 3D models. Now, the problem is how do I catalog and keep track of all these things? When I first started working on Pew two and a half years ago with the mission to try to decipher the language, I only knew of a few inscriptions, so I didn't feel the need to number them. I came to Pew studies from Kitan studies. Kitan is an extinct central Asian language, also primarily preserved in the form of inscriptions. And Kitan studies, for whatever reason, no one has ever come up with any kind of cataloging system. So there are no universal numbers, there are no universal conventions, it's a complete mess. And I, unfortunately, inherited this chaotic mindset, which turned out to be a disaster because it turns out there were a lot more inscriptions than I had thought. Here is the so-called inscription house in Burma where there are multiple Pew inscriptions. As a starting point to resolve this mess of cataloging, Charles de Rozel in the early 20th century came up with initial inventory of Pew inscriptions. His first inventory had just five items. He eventually enlarged this to have 15. More and more inscriptions were found over the following 80 years, but the problem with Pew studies is that after the colonial period, Burma had gone through some rough political times and Pew studies basically were frozen in time. And so, fortunately, a few years ago, Arlo Griffiths and Julian K. Wheatley picked up where de Rozel left off, adding 155 inscriptions to his list of 15 for a total of 170. And Arlo and Julian have collected these inscriptions in an Excel file containing these various fields. Now, what I could go on and on about all these fields, but the field of interest to me is numbering the numberless items. Now, how are you going to, or how is one supposed to number these things? They, almost none of these inscriptions have dates, so we can't arrange them in any kind of chronological order for the most part. The few dates we have are terminus aquaul. They tell us a point where we know that the inscription must post date, but it doesn't tell us exactly when it was made. And a few dates are in a calendar we don't even understand. So, we can't use chronology to help us organize these inscriptions. Geography is not all that helpful either, as most of the inscriptions come from three sites. So, ultimately, we ended up using Durazal's original 15 in their arbitrary order. And the remaining 155 inscriptions were grouped into thematic groupings and numbered consecutively within each grouping. So, 16 through whatever is the pure inscriptions in stone, pure inscriptions in the polylanguage were another set, and so on. Unfortunately, new finds that fit into these earlier categories are just dumped at the end, and this is unfortunate, but unavoidable. And new finds are being found all the time. Here's an example that I found while I was visiting in Burma exactly a year ago. We went to an archaeology office and boom, we were told, hey, we just found this two days ago. Now, what I do with all of those photographs and whatever is I try to transliterate the text. So, this is taking the raw material, taking the raw materials, we're converting it already into photographs and so on. And this is the next stage where I take the text and try to convert them into letters. Now, generally speaking, pure is written in an index script. They are conventions of how to convert index scripts into roman script. And so, generally, we follow them. But one huge problem is that we don't know exactly how pure was pronounced. And this is a problem because it means we don't really know how meaningful this transliteration is. We're just, I'm just mindlessly copying letters without knowing exactly what they stood for. I'll get, and then I take these letters and I put them into XML files, one per description. These XML files are in the Epidoc standard and they contain all sorts of information, such as other people's readings in the past. As a phonologist, I then take these readings and chop them up into little pieces, like constants and vowels, and these are put into columns in Excel files. Now, earlier today, we've come up with, we've encountered the theme of version control and why is that important? Well, here's a personal example. As I said before, we don't really know much about how pure was pronounced. We're just mindlessly copying letters and converting it into roman script. So, there is this subscript dot in pure that is very, very common. And for a century, people had different arguments about what this dot stood for. When we first started transcribing pure, we just arbitrarily decided, okay, we'll just make this dot an apostrophe and we'll stick it after the last constant before a vowel. Later on, we changed our minds and decided to mark this as an M with a dot, since M dot is a convention in endology for writing superscript dots. We thought, oh, it'd be really clever to use M with a subscript dot to write a subscript dot. Now, here's where version control came in. This isn't just for 22nd century archaeologists. The problem was that I've mentioned my Excel file, I've mentioned XML. We also had RTF and text files. Migrating the data through all these different formats meant that when we change our minds about how to transliterate pure, it meant that we had to, of course, change all the other stuff in the other formats. And this is where version control became crucial because searching replace with pure is a nightmare as I tried to use regular expressions to do search and replace. But pure structure is so complicated that a simple formula doesn't quite work. Also, there were typos and such where the apostrophe was the wrong place to begin with. And so I did innumerable errors and had to convert things over and over and over again. And sometimes I would think, oh, I'm done. And then I would, I also use GitHub, by the way. So I'd upload things to GitHub and I'd discover, oh, God, it's all wrong. And so this is where version control comes in handy because you go back to your old version. You try to diagnose where your search and replace went wrong. You undo that. You do the search and replace again. You re-upload to GitHub. So that's the value of GitHub for me personally is this type of error management of discovering where I went wrong in the past of restoring the old data and so on. I've mentioned GitHub, the XML there, we then move to the publicly accessible site purpose of Pew Inscriptions. My attitude toward Pew Decisement is an open source kind of attitude. I like the idea of having the public being able to look at the same data I do and come to their own conclusions about it. This website contains the latest version of the XML for the Pew Inscriptions. It's synced with our GitHub. It contains images of the inscriptions and a bibliography of Pew Studies. We've already mentioned Zinodo previously. The XML, the photographs, the RTI files are publicly archived at Zinodo. The future steps with data include some that I forgot to mention here, so I will. I've been talking purely about Pew Script stuff. Future steps include, well, if we just start with the Pew Script, I'm working on an archive of images of Pew actuals. Actuals are our character combinations representing syllables. All this colored material here is a single actua for the word for PIN that I translated as Kedah. These acturas, I have taken screenshots of them and I am building an archive of them so I can try to break down Pew Script into its components and look at how different letters take different shapes and different environments and so on. All of this will eventually be publicly archived as well, but this is still in the compilation process. Using this analysis of Pew Script, I hope to come up with a proposal for encoding Pew and Unicode and at some very late stage, ultimately convert the Pew texts back from this transliteration stuff into the original Pew Script in Unicode. Lastly, other data that I plan to work with in the future are, for instance, the Chinese transcriptions of Pew, which have never been systematically analyzed. That too will all be publicly archived. So I am going to now end 12 minutes early and we'll see if we can have a record number of questions. Yes? Are all the inscriptions just about Kings, Battles and Donations or is there some more interesting material in the inscription? Oh, well, the state of Pew studies, as I said, well, considering that we hardly know how the language is even pronounced, right now we only know about 200 words. And so quite frankly, most of the inscriptions are incomprehensible at this point. I mean, what I do when I transcribe these things, I feel like some kind of mindless robot much of the time because I have no idea what these things are saying. One saving grace of Pew is that it is an index script, so we recognize the letters, but the effect is like for most people here would be familiar with Indo-European languages, the effect is that of looking at Hungarian or Finnish, where you recognize the letters, but the text is just completely alien looking. The language is very distantly related to Burmese, and I have studied Burmese, and frankly it really doesn't help a whole lot. So the point is that we just really don't know what most of these things are saying at all. Some are our funeral texts because we recognize things like die, we recognize dates, so we assume those are death dates, but we don't really know. And we assume that the king and the name on it is the name of the deceased, but this is all assumption really. I mean no one has really come up even with something as simple as a really good word-by-word analysis of these alleged funerary texts. We just recognize these words and people jump to conclusions. A lot of Pew studies is highly conjectural and this is not really emphasized because people like pretending they know what they don't. So anyway, bringing back to a more database theme, the fact that we just understand so little makes cataloging this stuff really, really difficult. I mean I gave just enumerating the inscriptions as an example, but just trying to figure out how we're going to convert this into letters. I mean you can base it on Indic conventions to some extent, but there are things like subscript dot and other oddities that have no Indic basis. And I've argued with my colleagues back and forth, what are we going to do about this? And then it's like, okay, well let's search and replace and try it this way. No, and then let's do version control and undo that. And back and forth and back and forth. And that's because so much of this is just so unsettled. I think what you may find of some interest in my talk is that I'm dealing with the problem of cataloging almost total terra incognita. That's a very different issue from what other people are working with, I think. I mean just trying to figure out what categories to put things in is a mess. I mean the fact that you asked about the content of inscriptions, I really can't use that as a method of cataloging them at all. I mean I don't know what these things are saying and that's why the categories for the enumeration are just so crude. Is it on stone or is it on metal? Yes. So you mentioned that the XML files as a zip file are all on Zododo. Yes. And the primary sources in terms of photographs and RTI files. RTI files are big so I presume that you can't fit that all in one Zododo record. So how are you keeping track of the relationship between the different Zododo submissions? Each Zododo submission is cataloged by the Pew inscription number. And so I looked up things in Zododo by those numbers. And under each number there's a huge file with the RTI and the photos. Frankly what I don't like about Zododo is that I can't just grab a single file out of these huge collections. Oh that has to do with how they're uploaded I think. Yes. Because they were uploaded as zip files you have to download as zip files. Right right take them apart. But if I had uploaded them as individual files then it would have been easier to use. Yeah but on the other hand then we would have like hundreds of files per inscription and that would be a different sort of problem. Yeah because RTI files require many many photographs to create a single RTI file that can be viewed with simulations of different kinds of lighting. Yes. You've made your data available in the Epidoc XML format. I was wondering if you comment on how much work that was and what the benefits of that? The benefits of Epidoc for me are that it builds a bridge between Pew studies and other kinds of epigraphy I think. I mean one weird thing about Pew studies is that it is it has been basically an island totally cut off from everything else. I mean even within Santa Tibetan linguistics it is pretty much cut off from everything else. When I when I studied the Epidoc standard Daniel and I went to of all places Romania to take classes in Epidoc we studied along with classicists using Epidoc for Latin and Greek. And the strength of Epidoc I think is that it's a shared format that is usable for any kind of epigraphy and so Greek and Latin epigraphy experts have encountered many problems already and have found solutions for them. Pew is still an infant field barely explored and so it is nice to have this body of expertise available that can be recycled for our own purposes without us trying to have to try to have to reinvent the wheel and I find that that valuable. I think Pew is just so unstable and so mysterious that any kind of help I can get is very much appreciated and using an existing standard like Epidoc that has been around for a long time helps toward that end.