 Hello everyone, thank you for joining this talk. We are going to be covering in general text processing and why it's actually quite complicated. First things first though, let's talk about who I am and what makes it so that I can talk about this with any kind of credibility. So I had been programming for about 15 years now, started mostly with embedded, although my history with that is pretty short, but that's where I got started. And that was with mostly added, but to a lesser extent, fourth spin and assembly. Fourth, many people pretty much know spin is a processor specific language for the p8x32a, which you probably haven't worked too much with, it's sort of a novelty thing. But about 12 years of that overall experience was with text. That would be in C++, C sharp and F sharp for the most part, although there have been exceptions, some things in strictly C, some parts of that in assembly as well. Most of that experience does come from work with compiler tool chains, and that is definitely where I got my start, although I have branched out from there, some natural language processing stuff, and some other things that we're going to touch up on a bit. So yeah, as I touched, that is worth mentioning. However, I have taught done quite a bit with language in a non-computing sense. So understanding things like what makes a language synthetic or polysynthetic versus aplanative, or what makes a writing system, like what is an alphabet versus an advocate versus an adjad, but I know what those things are. But this being said, I'm not really a programmer. It's something I've been doing for quite a while and have made numerous contributions to various open source projects. But for the most part, my work experience is non-computing stuff. As it happens to turn out, I've got about five years of medical work. I'm stuck in a hotel right now, basically living there for the length of this pandemic because New York State especially needs the help. So I returned back to that for the time being. The common thing I hear when talking about this kind of stuff is this, but I know this attitude. And you think you do. Hopefully you actually do, but that's just typically how this goes. The example I love to give for this because it shows multiple different misunderstandings in one nice, simple example is the palindrome. Almost everyone gets this wrong. And it sounds really simple. We're going to talk a bit about what those misunderstandings are. Well, there's a bunch of other things. The general approach I see is that most programmers think you can just reverse the string, compare it for quality with case and sensitivity. Typically, you'll see like too lower. Ideally, you want to be doing too upper. That's a very minor thing, but there is a few edge cases where that will matter. But if you're doing a case and sensitive comparison, you generally aren't going to run into issues if you are an English speaker and we'll get into some of the issues that others face with that. And the code I usually see written in a mostly C sharp kind of way is this. You guys don't have a reverse method that works on strings. So you'd normally call like array dot reverse or the link extension for that. But otherwise, this looks like normal C sharp code. The thing is, is this approach is actually wrong in a large number of cases. It fails with several and actually a lot of inputs. Any text above the basic multilingual plane. So touch up a little bit on it later, but there's a lot of misconceptions that basically all languages people speak normally is within the basic multilingual plane. That's not the case. The really big examples of that are the CJK group, the Chinese, Japanese and Korean. They actually have code points being assigned all the way into the tertiary multilingual plane. So yeah, that matters. Any text with combining marks. This one again has sort of a combative misunderstanding with, oh well don't use combining marks, use the pre-composed. You can't in certain languages. We'll get to that. Any text with symbols. See punctuation is a normal part of a sentence, but that algorithm didn't account for punctuation. It didn't account for spaces either. And there are some other issues, but by now you should be seeing that the algorithm was a bit naive. And just how wrong does this wind up being? Well for an empty string, it's going to work just fine. For these examples, some of you might hate me for that first one, it will still work. These are individual words with no odd things going on. Reversing them and comparing will just work. But now we're getting into some sentences, some well-known palindromic sentences. And you should start to see the problem here. But there's another particular example I like in that it's very simple and shows a lot all at once. Café, if-ok. I don't know if that's a real place. Maybe somebody is being creative and has actually named their place that, that'd be neat. But this shows a lot of problems in one simple example. Because you can get confusingly wrong errors like this one. Because of combining works. Usually when I show this, people will jump to, oh, you need to use certain normalization forms. And to an extent that's true for Latin-based languages such as French, normal form C will allow correct comparisons using this kind of approach. Normal form D will break for exactly the reason this broke. So normal forms are not necessarily going to save you here. There are other problems how certain languages have to use combining marks. And we will get to that. I have an example from, well we'll get to that when we get to that. So hopefully I have your attention now that there's some complexity about this that we have been ignoring and dismissing a way that we need to touch up on a bit. So some of the fundamentals we need to cover. ASCII. It's a very, very old character encoding system that wasn't really meant to handle all the different languages that are out there. It's a good idea to stop assuming text is this. It's a good idea to stop assuming that Unicode works like ASCII. These assumptions are responsible for most of the problems that I've wind up seeing. Unicode on the other hand is what basically what replaces this. In some very rare edge cases you see other encoding systems still used. For the most part you're not going to see that with your typical computing situation. They are mostly just still used in embedded systems. Because space is a concern for them. Unicode encodes basically every value. There's still some stuff being added. Like I mentioned, some of the CJK stuff is still being added into the tertiary multilingual plane. We're not entirely done encoding everything yet. But for the most part, unless you're doing some funky linguistic stuff, everything's there. Like I said, it handles basically every language. Just a few archaeological edge cases or a few artificial languages that haven't been encoded. But it's basically there. There are a few what I consider design problems with Unicode. Its creators were not Eurocentric. They were aware of some of the different language systems out there. But I would argue not aware enough at the time they were designing it. And so we see some problems because of that. That being said, don't throw the baby out with the bathwater. Unicode is overall pretty good. There are ways of pretty easily dealing with some of those design issues. It's just not perfect. But it's good. Unicode encodings, we need to talk about a little bit because this is going to come up. This is again responsible for some of the misunderstandings that we have been having. UTF, the Unicode translational format. I think that's right. It doesn't really matter what the acronym stands for. But this is the typical encoding that you use with Unicode. There are three flavors. And two of these have two variants for five total UTF encodings that you will typically encounter. There's eight 16 and 32 bit variants of this or flavors. The 32 bit is just the Unicode scalar values. We'll cover what exactly that is really soon. But there is also two other ones that you can see from time to time. It's not that often, but SCSU and Boku1 are also Unicode encodings. SCSU is, I believe, still officially part of the Unicode standard. Boku1, I believe, was at one point. But it's definitely not considered it anymore, which is interesting because Boku1 actually solves a lot of problems that SCSU had. But I'm not going to get into why that happened. Typically speaking, you don't see SCSU or Boku1 unless you also have to compress the data and send it over a data stream or whatever. That's pretty uncommon. Typically speaking, the compression that UTF-8 or UTF-16 provides is often enough. You're typically not going to deal with the additional complexity of Boku1 or SCSU. So covering some Unicode terminology again, there is the code point. This is any of the values in the Unicode table entries, just any of them. Historically, Unicode used to go above the hexadecimal 10FFF. That has been restricted just because the way UTF-16 works has a lower range than UTF-8 or UTF-32, and it has been officially capped at that upper value. Then we have the scalier value. This is pretty much every code point. There is a small slice in the middle. This is an earlier part of that, but that is not considered a scalier value. Essentially, a scalier value is any of the Unicode code points with an actual significant meaning. We'll get into what that little slice in the middle is, and how that's different. You have the code unit. Now for UTF-32, the code unit is just a code point. That's it. Technically, you're not supposed to use the other parts, which are the surrogates, but essentially, you could have the surrogates as part of that. It's just considered invalid. But the code unit is the smallest part of the encoding. It's more of an encoding thing than a Unicode thing. This is one of those things that people mix up the difference between a code point or a scalier value with a code unit. With UTF-8, the code unit size is 8, should make sense how this is working now. The idea is that that is how much you should be reading to determine any part of the encoding. Similarly, with UTF-16, the code unit size is 16 bits. Then you have surrogate pairs. This is that little slice in between the scaler values. These are specific reserved parts for representing stuff with UTF-16 that has to go up above the 16-bit range. I'm not really going to get into why they were encoded specifically as Unicode code points, but it's a good thing that they were. Then you have normalization forms. We'd mentioned a little bit about this. You may be familiar with the concept. It's generally when you run into issues, people jump immediately to, well, normalize the strings before you do stuff. They get you part of the way to what you want, but there are some issues with that that we will be talking about. There are four different normalization forms. Canonical C, Canonical D, then what is it? Compatibility C and Compatibility D. I'm not going to talk about specifically what those are and how they vary. I just want you to be aware of them. We're going to touch up on that a little bit more later because, like I said, it doesn't completely solve the problems and you should be aware of why. Then you have the BOM, the byte order mark. This is used with encodings to tell you what endianness the encoding is. That's what I meant about the variance. UTF-8 doesn't have variance, but UTF-16 and UTF-32 need a byte order mark to specify what the actual endianness of the stream is. You have graphing clusters. This is what most people think they are working with and is not what you're usually working with. This assumption is responsible for a huge number of the problems that we wind up seeing. Also, we need to cover a little bit about linguistics terminology just so that we're all on the same page. Orthographies are the overall rules for the language. It includes the writing system, the grammar, spelling, if your language has a standard spelling. Fun fact, English does not, so arguing over spelling is kind of weird because there's no official spelling standard dictionaries even vary and stuff. Certain languages have that though. Then the alphabet. Like I said, this is part of the orthography with the writing system. It's not just alphabets. There are abjad's abogadas. There are logographies and ideographies and syllabaries. There's all sorts of different approaches, so it's not just alphabets. You have the phoneme. This is the single smallest part of the sounds of a language. And the grapheme. That is the single smallest graphical part of the writing. Typically speaking, with alphabets, there is a one-to-one correspondence between graphemes and phonemes. Other language, other writing systems do that differently. And none of this directly solves our problem. But I will say ahead of time, I'm not going to be entirely negative. I'm not just pointing out problems. I will show you about what is actually being done to help with these issues. So what are we actually doing? Because I know what we think we're doing. We're not doing that. One of the things that people should really be more aware of is that in .NET, the string is UTF-16 encoded. This has some profound implications. This means that a car or character is a UTF-16 code unit. We often think it's a grapheme. It's not a grapheme. It's not even a unit code of scalar value. It's a UTF-16 code unit. This is important for certain languages which have to use parts of the astral plane, which is anything not in the basic multilingual plane. That does actually include some real spoken languages. So yeah, it's not just like archaic stuff in there. There are actual spoken languages that have to use stuff in the supplementary multilingual plane and tertiary. The rest of them we haven't allocated yet. Rune, which we will talk a little bit about towards the end. It's a somewhat newer thing. This is a unit code scalar value. It's still not a grapheme. It's still not what we think we're working with, but it is considerably better. So if you, for some reason, can't use any third-party APIs that we'll talk about towards the end, this will get you a good amount towards what you actually want to be doing. Continuing on with this, indexing for a string returns the UTF-16 code unit at that location. This may be a scalar value if that is below the 16-bit limit that UTF-15 bit below the point that you need to get values to start encoding. If there are surrogate values in your string, what you have to assume that there is, it could get you a surrogate half, which then is pretty much meaningless without the other surrogate half. So indexing in a string is actually not that good of an idea. You should probably stop unless you are certain that there's no other language stuff. Parsing a programming language, for instance, it may be fine to index into a string, but generally that's an operation that you should actually stop. Similarly, links-reverse reverses the collection and in the case of strings because string is an array of UTF-16 code units, which you actually wind up doing is reversing the code units. This has some problems and the UTF-16 surrogates are reversed not in the way you actually want them. They need to be in a specific order and when they are reversed, you now wind up with broken parts of your data that's bad. And array-reverse does the same thing, so it's not like one of these does the right one, one of them does the wrong one, they're just both wrong. So there's a number of misconceptions that really need to die. We've touched a little bit on them, but it works on my machine. I really thought we had killed this off and in many fields of programming it does seem like we have killed this off. You barely hear that attitude anymore, though I did have the misfortune of seeing that about three weeks ago on a Microsoft GitHub page from a Microsoft employee. So yeah, this one is particularly dangerous with text though. It typically doesn't and really only works with simple test cases. The big exception is that English. English is always good because of the orthography of English. There is no funky stuff going on with English and the problem here is that most of the world doesn't speak English and that there is a disproportionately large number of programmers that speak English. Logically follows that the test cases are going to be mostly English. I already said that. The samples rarely wind up including combining marks. Like I had said, most human input winds up using pre-composed characters, although it's very wrong to assume that you're going to be getting pre-composed characters. The only reason for that is that there is a large amount of languages that already have pre-composed forms of their orthography in the Unicode standard. But as I had mentioned, there are languages. One of them is pretty major language that does not have pre-composed characters and has to use combining marks. So you have to support combining marks even if you normalize in normal form C. And part of the issue as well is that a lot of the samples used to test algorithm implementations rarely include anything from the supplementary or tertiary multilingual plane, collectively called the astral plane. So what is it that we're actually doing now that we know what we think we're doing? We're not working with graph aims. I think I've already made that pretty clear that car is actually a unit of encoding and that rune is a table entry. It's better but it's not graph aims. Array reversal reverses the code units, not the graph aims. You never want that, by the way. I have literally never encountered a situation or read about a paper that utilizes reversing the encoding of text. I can't even think of a situation in which that is acceptable. That should always be considered an error because it's just an error that you break your stuff. More misconceptions about this is that many programmers also misunderstand text itself. And I think it's from getting into text stuff but not brushing up on linguistics at all. I'm going to be nice here and not name the company but there is a pretty large company who does a lot of stuff with text whose main people doing stuff with text are a security researcher and a multi-threaded performance optimizer guy. It's really good with that kind of stuff but no linguistics experience, no text specialized programming, just unrelated stuff. They have some broken products as a result. Tying back to what we were talking about before, palindromes aren't just single words and a quick search in the dictionary will show you that even if you've never taken a literary course like a collegiate level you can still figure out that there's palindromes beyond just single words. And we're not going to talk about what these are but there is a difference in ligatures between orthographic ligatures and literary ligatures and it gets particularly complicated in that the exact same grapheme can be an orthographic ligature in one language and a literary ligature in another language. So dealing with that is appropriately can actually get kind of complicated. Normalization forms won't save you as I have been mentioning. You'll see why, I promise, but it's not actually going to save you. They are a step in the right direction. They're just not a panacea. Then combining marks, as I had mentioned several times at this point, while there are languages that have pre-composed variants and that helps simplify things, there are languages including a very major language that uses combining marks and has to because of how that language works. So the real-world implications of getting this stuff wrong. Almost every text algorithm, at least that I've seen, but I have spent, got in preparation for this over 300 hours looking through stuff, trying to find some stuff that it was implemented aware of these issues. I didn't find anything. Almost all text processing, almost every text editor almost every program that works with text, but that's a lot of them. They're broken for part of the world's people. It varies which one depending on what the specific issue is. Never for English speakers, but it is pretty reliably broken for some part of the world's people somewhere. They get pretty broken. Gip, Gip, I don't know or care how this one gets pronounced. I refuse to use it because it's quite bad partially because of this problem, partially because of other stuff, but this issue was filed eight years ago and it's still not fixed. Gip doesn't understand paths with non-latent symbols. So if you are an Indian whose name is on the computer is in Hindi, you go to try to contribute to a project that uses Gip. It's going to crash and you're going to have an interesting time contributing. The typical work around which is given that I consider very unacceptable is that the contributor should create another user account on the computer with a name that does not use problematic characters and contribute to that user account. The problem with this is that you can have a tooling that you like to use that is specifically tied to a user account and a different user account would require a different license. There are other problems like all your configurations and settings and additional programs that you use are located on that user account and trying to use a different one requires moving a lot of stuff over. That is a huge pain. Similarly, there are some tooling that you install that's their default installation method or because you are on a corporate account where things are rather restricted by IT, you've chosen to do this, that you install it in a user specific path rather than system-wide. So then you have to do that again or figure out how to sim link it or it's a pain. I really don't consider that an acceptable workaround and there are actual programs that have failed because of this. Similarly, there are some interesting problems. One of this I have encountered while doing SOF, AdLum which is the script used for FURA or Furani can't be entered in Visual Studio. I'm not entirely sure why because the scalar values just below it and just above it work fine but any AdLum specific scalar values cannot be entered in Visual Studio, which is weird. You also have other issues and I have specifically put the matching one below the non-matching one because the thing I found some people like to do is go oh of course it's matching the first one and not the second one because you didn't set the global flag. No I did and you can clearly see here that it's getting the second one not the first one and they look the same but this one works. Yeah, it gets interesting and so here we have I believe this is Gulshan in Hindi. Now Hindi is one of several languages written in Devanagari. I'm not entirely sure I pronounced that correctly. I am not a Hindi speaker or any Indic language speaker. Hopefully I've got that one right. Devanagari uses what is called an avogada. It is a type of syllabary in which each symbol represents a consonant vowel pair. However, how it is different from a syllabary is that there are accents combining marks that modify the vowel for it. So a specific syllable will have a default vowel but then you add a symbol to change the vowel for that syllable and Gulshan happens to have one of these. So when you go through the typical reversal thing even if you convert to a normal form you get this problem. The reversal did not work correctly. Let's go back to this one. Hindi is by most things I've seen the fifth most common language spoken in the world. That's a major language. It is pretty important that we actually get that stuff right. Right? And we sometimes get this severely wrong. I didn't copy specific examples into here but if you wind up looking for unicode normalization problems related to samba you wind up seeing quite a bit of them and it's not because samba is wrong but because as I mentioned there's four different normalization forms. If you use a different normalization form on different machines you wind up with problems. Sometimes these problems have caused data loss which is bad. Right? We can agree that's bad. So you want to get this stuff right. Now anybody who's worked with very large systems and I've talked to a few people about this they pretty much all agreed that trying to maintain a single normalization form across an entire architecture would just be insanely difficult that you're probably not going to get that one right. And what happens if you realize that one normalization form wasn't right and you have to migrate to another one. You have multiple servers that need to communicate with each other. You roll out a normalization change on one and then the others don't have that yet because you still need to roll that out on the other machines. So in that period of time where they have different normalization forms what happens? Or do you bring down the entire system to do the updates because then that's a total halt in business activity. That's not good either. You see why normalization forms don't really save you? They might on very small programs but when architecting very large systems you start to see the problems. So like I had said I don't want to be a negative Nancy in all of this. I want to show you that things are being done both by myself and others to help make this overall better. Now we are going to be talking specifically in solutions for the .NET space although I have a few things that aren't programming specific that are just useful. If you're interested in non .NET language stuff because this talk hasn't been strictly .NET specific text suffix everybody. You can there's contact information at the end you can ask me about language specific stuff and I can typically find that for you. So as I had mentioned we are going to talk about this rune in system.txt is a Microsoft specific well is an API introduced by Microsoft as part of .NET standard. Wait no is it in .NET standard yet? It's in .NET core for sure. I believe that was introduced in .NET core 3.0 which might mean it's not in .NET standard yet. I have back ported it so if you are on .NET standard 2.0 or up you can definitely use it because I have back ported it to at least that far back. Well as I mentioned this covers a unicode scalar value it's not quite a grapheme but it's a huge step in the right direction. The API is very much like car which is wonderful because it substantially reduces the learning curve pretty much everything works the way you want it to and the way you expect it to. The only difference because the only noteworthy difference for something like this that is is that indexing is now done through get rune at and then you pass the integer for the index into there because you can't provide multiple indexers that have only a verity omni return value at least in C sharp apparently you can do that for in the iL itself but you can't do that for C sharp. That's it for that one. We will talk a little bit about stringier very minor but it's one of my projects specifically I want to mention the glyph data type. This does a very similar thing to rune and is different from some of Microsoft's stuff on doing this. I'm not going to get into too much about how they are different but know that I'm sort of addressing some problems that they have with theirs. I've seen their what is it the string info to text info they have a text element enumerator thing somewhere in one of their APIs. I have seen it incorrectly declare French that it was going over as being equal to some Gagnacahaca text which should never ever be true ever because of how it works. As far as I'm concerned it's flawed. Glyph was partially designed to address quite a bit of those issues but it represents a unifocode grapheme cluster which is what we think we're working with and I had specifically designed the API to be as close to car and rune as possible because it decreases the learning curve. There's also icu.net. I see you if you're a Linux or a free BSD user you're probably familiar with it's a major reason why that those platforms have less language issues than Windows does. I'm not sure why Microsoft isn't just using them and is doing their own thing they should probably stop that but thanks to the guys over at SIL you can get used the icu4c library through bindings and there's where you can get it. I do highly highly recommend using this over these stuff in the .net runtime. My own stuff isn't using it just yet although I am investigating exactly where I can tie into it and what the implications are for that because it's a little funky using .net and NuGet packages with native libraries but I'm willing to deal with the complexities there because icu4c is a wonderful wonderful library for this kind of stuff. There's also L10 and SHARP this is again from SIL it's an advanced localization library I'm generally going to recommend this flat out over what is done in .net as well. SIL is a bunch of actual language people versus Microsoft who typically doesn't have dedicated language people doing their language stuff so yeah you figure it winds up being quite a bit more sophisticated. Again you can get it there there's libpalasso again from SIL you should be noticing a general pattern here it's not that I'm an SIL shill it's that there's barely anybody solving these issues it's unfortunate like I had said I looked for about 300 hours trying to find stuff there's not a lot it's it's pretty bad libpalasso does a lot it's sort of their base class library for a lot of text related things I'm not going to go into everything that's in there in this presentation but basically look through it real quick chances are some common things that you have issues with are already solved and present in there and you can get that there you've keyman this is really only going to be interested of interest if you are doing a lot of text entry in unusual languages pretty much keyboards for every single language you could possibly imagine pretty much you can get that there it works on more than just windows if I remember my stuff correctly pretty sure it does again SIL is pretty good about these kinds of things script source this one is more of a large database so if you need information about languages I highly suggest getting it from there. Andika this one is a font I'm actually going to recommend this to far more than just anybody working a lot with language I've heard a lot of dyslexics highly prefer this font and even just people who have general eye strain tend to prefer this font because it was designed specifically to help be with reading issues it was specifically designed to be very easy to read with very easy to distinguish letters from each other and it does a phenomenal job with that I find it very easy on the eyes it's useful for non-language stuff it's it's truly great but for language stuff specifically it will help out a lot especially if you're trying to proof read some stuff test some stuff that isn't a language you do not read because it will help you distinguish what you're looking at quite a bit better than you were able to and then there's stringier this is my big project it encompasses a lot of smaller projects there's a core library which is a large number of extensions a back port of rune a few other things like some formatting helpers and it's just sort of the base for the entire thing that's how it largely got started it's just packaging up a bunch of extension methods for string to kind of make the general experience better then I built quite a bit on top of that there's also a very high performance very lightweight as in the parsing is allocation less yeah as far as I can tell I'm the only parser framework for dot net that has an allocation list parser but the general performance is very very competitive like top 25 percent in almost every single instance and there's generally coming out as a leader or tied for a leader the approach it takes this novel I'm not going to get into different parsing framework approaches that's more of a specific subject that I'm not diving into here but if you want to know more about that you could hit me up there's going to be contact information at the very end there is a what I'm working on right now which is a stream wrapper that is meant to address a number of concerns and criticisms I have over the design of text reader text writer and the streams if you want an idea of what my criticisms are and this is only a small glimpse into it take a stream set up a text reader for it read a bit and then seek the position of the stream back to the beginning to reread what you have read and you will see what the problem is it breaks spectacularly and it is because of a design problem that they didn't fully test and it's not fixable because of the design because the buffer is located in the wrong place uh one more sub project there there is a what is essentially more extension methods but they're literary specific so correctly determining a palindrome but also things like what a lipogram or heterogram or pangram are there's a framework inside of that library to help get specific orthographies for a language uh because it's actually not that simple uh English for example can be written in latin script but also shaman and deseret and that's pretty common across languages where there's actually multiple scripts and I'm trying to not fall into the pitfalls that have happened before me where there are assumptions made that wind up being wrong so the entire thing is largely like a database driven kind of thing whereas if you specify a language and a script you get out an orthography and then the algorithm to correctly determine a say a lipogram uses that information that it just fetched to actually do it regardless of what the language is so very extensible you don't even really need to know programming stuff to add in your language to that which is wonderful but also means that the algorithms are not assuming that you're working with English specifically if somebody were to add in say the hindi in uh devanagari then it would just work which is wonderful so in closing I am probably working at the time of this presentation I think I am the one oddball who can't actually make it because I'm helping out in a pandemic yay so I am not going to be available for questioning however I would like to give special thanks to Gulshan Saini and Jeremiah Breeden Gulshan for helping me with the hindi stuff because I am not familiar with hindi and Jeremiah for largely being a soundboard over this entire thing he knows almost nothing about other languages and almost nothing about text processing so he was very useful in making sure I explain these things in a way that makes sense and doesn't assume that you already know it and if you're interested in contacting me one of the best ways to do that is by twitter so you can tweet me at pkl7 and you can also check out my github it's not just text there's a few other things but by far string here is my major project so hopefully you've learned that text processing is a bit more complicated than people are giving credit for but are leaving optimistic about how people are tackling this everybody works with text let's make it better stay safe guys