 Thank you. So I'm Rob Tompkins. I'm from Richmond, Virginia. I probably should go to my who is this guy slide, but I changed it a little bit. So who I am. I am Chitompeke, which is an artifact of my Virginia Tech email address, the first two of my first name and the first six of my last name. So my first name is actually Christopher, if you're trying to find me. I'm a committer on Apache commons. I'm the release manager for commons text. And I do software development. I do Java and DevOps. I tend to end up in the DevOps space because I end up doing the work that nobody wants to do. And it gets the stuff out the door. And I figured I just put that I'm a mathematician and a logician on here. Maybe I did that in school. So why not? Oh, and pardon the minimal slide design. I kind of like minimal slide designs. But if that puts you guys to sleep, it wouldn't be the first time that people fell asleep while I was talking. So introducing commons text. I've got two goals here for what we want to do with commons text. Let me see. Is this the goal slide? Yeah, this is the goal slide. The goals are to introduce a standardized set of text processing algorithms and libraries for reuse across Apache projects. The big goal here is reuse. So it's kind of open ended on how complex our algorithms get and whether or not we get into natural language processing and stuff like that. Because we have a top level project that is natural language processing, namely open NLP. And the second goal, this kind of pulls me into the second goal. And that's to remove some of the heavier, textier sort of things from commons lang. So that commons lang stays to being a relatively all-inclusive, but minimally all-inclusive library for any Java developer. So we want to give them a very, very solid set of tools that doesn't include the kitchen sink and a couple of sports cars. So what's the history of commons text? In October of 2014, I think Bruno brought to the dev list an appetite for including the Levenstein edit distance in commons lang. And if you go out there and look, it's under Lang 591, I think that's the Jira issue. And I think the community decided that that was too complex for commons lang. It doesn't really fit into that space where it's going to be arbitrarily useful for any Java developer. And so Bruno and Benedict put together a proposal to create a sandbox component and did substantive development over two years to where I kind of jumped in last fall. And I was fortunate enough to have a really solid code base to work from because these guys did so much good work. And so I kind of picked up right where they left off. And by March of this year, we had our 1.0. So what's the current layout? Well, fortunate, if you guys are familiar with Lang, a lot of the code base is from Lang. So if you've seen that before, pardon, but we'll do a little bit of looking at that. So the current layout for text is things that are textier than string utils, specifically the stuff from the text package in Lang in hopes of deprecating that stuff and altogether removing it in the 4.0 version of Lang. So string builder, formatable utils, string substituter, string tokenizer, these are all things that we've included in the code base along with some extra stuff. So we've got some diff utilities that are under a diff package. We've got some string similarity and edit distance utilities. Now that brings me to the distinction between what a similarity and an edit distance is. A similarity is kind of a number that indicates whether two strings are the same or not. But it doesn't conform to the mathematical definition of a distance, meaning that if you have three points on a plane, that they either form a triangle or a straight line. It's called the triangle inequality. If you guys are familiar with that, if you're not, you're welcome to look it up. It's it would be interesting if you could set it up so that one of the legs of the triangle was longer than two of the other sides. Something like that, the addition of the two of the other sides, it wouldn't be a triangle then. Anyway, and we've got some translation stuff specifically for escaping all types of text files, XML, CSV, JSON, Java. There's a bunch of different escape utilities that we have. And the translation package supports something that's at the top level package of text that does string escaping. So let's look at the code that we brought over from Lang. So String Builder, String Builder is an alternative, STR builder, excuse me, is an alternative to Java Lang String Builder, which provides better instance methods because it's offering more mutability at the string level. It loses its thread safety. So it's worth knowing that, but it does afford you a lot more subtle mechanics around building a string. And so let's look at some examples here. So we're building it up with the string test, and we can read from a readable as well as append in all the standard String Builder methods that you would find. Pardon the reuse of the variable here, but we can new up another one and replace all the B's in this string with a different string. We can replace beginning from with this string and have the string continue at the index of one. So we have a bunch of different options that String Builder doesn't necessarily afford us with this one. Formatable utils. This affords you the ability to do justification and things like that. So if I've got a string and I want to left justify it or right justify it and specify what characters I want to use in the justification paradigm, I can do that. It provides control over the formatter. And it gives us that control over how we want to pad the string on either side and such. So if we look at this, we have a string foo that we want to left justify, and we want it to be six long, and we don't want to have a maximum amount here. So let's go I've got it written down what the signature is on this. So we've got the car sequence that we want to justify, or char sequence. The formatter that we're going to use to format it, so we can pass around a formatter if we want. The type of justification that we want to do, the minimal length of the output string, the desired maximum of the output string, and the character with which we want to pad it, if you pass in negative one, obviously, or maybe not obviously, but that accommodates arbitrary maximums and size. So the output of this would be foo star star star, and the output of this one would be foo. And because we didn't, oh, I thought I remember that comma shouldn't be there. Pardon that comma. Clearly that would not compile. If you don't pass in a character with which to use as your padding character, you just get spaces. So the next one, Benedict will actually talk about later, but I changed my example so that we don't use the same example, is essentially a templating engine. And it essentially accommodates you using dollar sign squiggly braces and a map to do variable replacement in a string, which is convenient. At my day job, we use this for doing content replacement in the UI, actually. So let's look at this. So if we've got our dollar sign variables in a string, and we have a values map that we pass in during the instantiation, then when we do the replacement, the left-hand side or the keys on the map will get replaced with the right-hand side here, and our output will become, and yet again, another typographic error. I left out the period. Being a presentation on typographic errors feel really, really bad. So you can also use some arbitrary maps in the instantiation of this. You can use the operating system map and different things like that for population of it. And I believe that you can use different variable syntaxes. But for the sake of simplicity of examples, we've stuck with this one. The string tokenizer is a generalization on comma-separated value parsing, so it accommodates delimiters and quote characters and even ignored characters. It's similar to string tokenizer with more flexibility. We've implemented the list iterator interface, so let's look at our example on that one. So our goal is to parse this string using semicolon as a delimiter with the quote character being the standard quote. This is taken directly from our unit tests. I actually simplified it a little bit, but one would expect, because we have quotes around the value here, that we would get that as our delimited value, obviously. And set ignored matcher in this case says that if I'm dealing with a space character, that I want to trim that down and have it be represented as the empty string coming out. So our output array ends up being and ignore empty tokens. That simply says that if I have the empty string as a value in the array, I want it to remain in the array. And so we end up with what we would expect out of that, which is a, b, c, d, semicolon e, f, and then three empty strings. And again, I'm real good at typographic errors if you guys didn't notice. This is one of the benefits of open source work, is that there's generally another set of eyes on what you're working on. And so you're less likely to get typographic errors like this. Exactly, exactly. And this is an open source project. You guys are welcome. The project itself, or the slide deck is actually on the web. So if you guys want to see that, it's, come on, go back to the beginning. It's actually at this URL. So if you guys want to follow along, you're welcome to follow along there. You can also find it under Chitompeki. We did Formatible. We did Substitutor. We did Tokenizer. And String Escape Utils. So this is that whole translation bit that I was talking about, that package dedicated to translation. Getting to and from different escapes is what we're going for here. And so we can do JSON escaping where our single quote character needs to be escaped because single quotes are used in JSON and they don't represent values in Java strings. Or in this case, we can do Java escaping where we need extra backslashes on each backslash. It's worth noting that there was said to be a vulnerability in the ECMA escape routine in that if you include the ECMA escaped text in HTML, it may render the HTML invalid. But that's inline in HTML. And if people are needing that level of security, they might need to go to a library that specifically works in the security space. The Java doc actually points over to one in particular that was suggested in the issue. We closed the issue. We moved on from that. But the purpose of the library is very specifically to escape JSON or escape JavaScript or ECMA for that matter. So one has to be careful when using it outsides of those contexts. So now that we've gone through the stuff that was in Lang, let's go over to some unique functionality to text. And I am realizing that I'm way ahead of schedule. Oh, well. We can talk about text after this where we think the library should go. So the big bit of unique functionality is in two places. One is code that I'm not particularly familiar with, but I have spent a little bit of time in, and that's the diff area. And the other stuff that I've actually spent considerable time in is in the similarity score and edit distance space. And so something that we implemented was the longest common subsequence algorithm, which is kind of a convenient algorithm to use for determining similarities between words or strings. I use words and strings interchangeably because in mathematics and logic, they refer to the elements over the cliny closure over an alphabet as words and in computer science, they refer to them as strings. But if we look at this, one might wonder how ABBA and ABAB has a common subsequence of three, right? Because one might think that, well, clearly the shortest substring in common is AB, right? But the definition of subsequence in this case is any combination of characters moving from left to right with some characters removed, they need not be adjacent. So for this, ABB and the right-hand side AB and then the subsequent B count as subsequences. And so the subsequence on this one's three. The subsequence on this one's pretty straightforward, frog and fog, right? Just remove the R and it's the same. Pennsylvania and some contrivance of the word Pennsylvania has a common subsequence of 11 and elephant and hippo has a common subsequence of one, whether it's either the H or the P. It's also worth noting that the longest common subsequence algorithm is particularly slow in that the fastest that we can do it is the order of the left-hand word, the size of the left-hand word times the size of the right-hand word. And if you're trying to do longest common subsequence across an arbitrary number of words, you start bumping up into the NP-complete area, okay? So we've chosen to limit ourselves to two words just to keep people from bumping themselves into the NP-complete space. I haven't thought about generalizing this to an arbitrary number of words or a longer number of words. It's not unreasonable to think that we could do something like that. That kind of feels like it still sits within the space of the library. Anyway, so that's the longest common subsequence. If we take that and normalize it, normalization in the sense of creating a distance out of it, it becomes the way we take a subsequence and impose characters to get from one string to the next string. And that gives us a type of edit distance in that the way you edit is what drives the output of your distance metric. So if you accommodate substitutions, right, that's one type of editing. Whereas if you pluck a subsequence and insert characters and pluck another subsequence and insert characters, it's not quite substitution. Anyway, it's pretty clear that we only have to do two edits to get from A, B, B, A to A, B, A, B, and one edit to get from frog to fog. I don't know what the three edits are to get from Pennsylvania to the contrivance on the right. And I haven't done the exercise of getting from elephant to hippo. But do know that we're moving from left to right. Okay. So P, H, and HP over here won't necessarily fit together appropriately. Another one that we've got is the Levenstein distance. So if we look at this and we compare it with the results of the other one, we notice that everything's the same except for its fewer edits on hippo to elephant. And my guess, I'm less familiar with the Levenstein distance algorithm off the top of my head. But it wouldn't surprise me if the ordering had less to do because a sequence is necessarily left to right and the Levenstein distance has to do with substitutions. So those are two of our edit distances. We've got a bunch more. I actually have that on the next slide. So what else is there? We've got a variety of diff tools under text.diff. The diff algorithm in there is largely based upon the longest common subsequence. Okay. And I believe that it's the Myers algorithm on the longest common subsequence. I should put a reference in. Regardless, it's in the Javadoc. So if you want to get into that, you should dig into the Javadoc. It's good stuff in there. We have other various similarity scores and distance tools. Namely, we've got cosine similarity. We've got the Hamming distance. You've got the Jocker distance. We've got the Harrow Winkler. Pardon my mispronunciation on that. The class was originally named Harrow Winkler. I was like, the Winkler? Anyway, so it's Harrow Winkler. And we've got a bunch of translate stuff that mainly supports string escape utils, but has more. There's more in there than just that. So the question is, what's next for comments text? And with the idea that we're trying to do considerable deprecation of the textier things in Lang, we would probably want to move over word utils minimally. And I don't know how much more we could move over. The boundary that I see between text and Lang would be, or I brought this up on the list serve maybe, I guess it was in the fall sometime, but if I'm a Java developer that's working on, let's say, I don't know, an Android app, okay, I'm probably going to want Lang and I'm probably not going to want text. If I'm doing, you know, natural language processing in Java, then I probably would want text or something like that, where I'm trying to find kind of that natural boundary between the two, where if I'm actually focused on doing work in text manipulation, then comments text is something that I would want to include. Whereas with Lang, I want everybody to be like, yeah, I want that guy. I want that one just because it's really, really nice to not have to write is blank everywhere. Because it's an ugly if statement to have, you know, null check, and then is empty. So that's kind of where I see the line between those. So I really want to get a 1.1 out in the next month. We talked about a 1.1, we pulled some stuff out of the 1.0 release that was potentially contentious in the sense that people were in the midst of having discussions about design and whatnot. And we were like, okay, well, if people are talking about the design of these components, we can just pull them and set them aside for the time being and roll out with 1.0 with what we've got, so that we can actively deprecate that stuff in Lang. And then we can pull that stuff in because it's not an API change, it's a non-breaking change, pull that stuff in and provide more tools, and that's cool. And the codebase now has those two things in it, specifically word utils and a random string generator. And so rolling a 1.1 isn't unreasonable at this point. In fact, I probably could have done it last week or the week before had I not been underwater at work. But that's the nature of having a day job. And I'm assuming that we don't have any bugs in 1.0 and we don't have to roll a 1.0.1. So that's what I've got. So we've got word utils coming in. We've got some updates to it. Can I remember what the update is? No, I can't remember what the update is. It has to do with re-adding a method that is in a couple of different places. It's in string utils, but the mechanics of it are more wordy in nature. And the implementation is vastly different from what's in string utils. So it being in word utils isn't unreasonable. We could pull up the pull request if we wanted to look at it after this. The other thing is the random string generator. And a lot of things go out to the commons RNG crew for putting that in there. They've been doing a whole lot of solid work in the random space and the probabilistic space with the commons RNG component and I guess the forthcoming commons numbers component and some of that stuff. I kind of wade over into the math territory a little bit having come from a mathematical background, but my research in math was in the combinatorics on word space and functions that map elements out of one cleanly closure to another cleanly closure and how to avoid patterns in that space. Anyway, so that's the next thing on the list is to deprecate the stuff that is in text for the upcoming Lang release, which we may or may not be fast enough for depending upon if Benedict decides to run with that release sooner than later. I think some of the deprecations have been done thanks to Pascal Schumacher, but I'd have to get into Lang and look a little bit more carefully. So like I said, I'm running fast. This is all I've got. So do you guys have any questions on things? Some thoughts occur to me are that I've been doing a lot of the work out of the M Lothair book on applied combinatorics on words and taking some of the more fundamental principles out of that book and writing those into the code base would be reasonable for open NLP and other textier sort of applications or bioinformatics applications have a common place for them to go because a lot of these distance functions aren't really easily findable out there in the Java environment and a lot of people are probably implementing them themselves. So the goal is to have a common place for that and so I tend to fall back to that book and maybe a little bit of the work out of the University of Waterloo. They've got a pretty solid combinatorics on words crew that comes up with interesting stuff but that's kind of where I see maybe the 1.2 or the 2.0 comments text going. But do you guys have any questions on what we have here? Yeah, it seems reasonable. No, that seems quite reasonable. I mean, I suppose you would need something that implements the comparable interface or something along those lines where you can say okay, well this is a list of items that we know how to compare. But aside from that, the mechanics of going through one of those metrics is I mean it's elemental in the sense of it's you're doing it element wise. So no matter what the alphabet is so to speak, you're still operating on individual characters of the alphabet and so it's easily generalizable to whatever has that mechanism. That seems quite reasonable. Is there anything wrong with OpenNLP? I'm assuming he's been quite busy lately because he's been kind of mildly in participation here and the last couple pull requests that have come in he's at least been looking at them but he hasn't been actively contributing code on a regular basis. So I mean that's one of the drawbacks of being in the open source world is you never know what other people's timelines are and so it's all it's all just a slow game but he definitely you mean so we have at least one committer out there. It wouldn't be unreasonable for me to probably start treading over in that space. I haven't done that yet. Personally I'm fairly new to the whole open source world. I hadn't made any open source commits prior to March of last year or something like that so I don't know. I enjoy it so there's no reason to think that I shouldn't tread over in that space. Sure, sure. Finding Apache projects that have it seems to me like the best thing to do would be to find Apache projects that have Levenstein distance implemented in them or something like the Harle Winkler or the Hamming distance or what have you and maybe opening conversations in those projects saying hey guys can we standardize on location here and then maybe we can build the community in the sense of they're thinking okay all these text processing algorithms that are that we've reinvented the wheel in eight different locations how can we centralize that. That's probably a conversation that's worth starting and not unreasonable to figure that out. I know that I feel like Spark has something like that in it and a couple other projects have these implementations out there even what was it the Jocard the Jocard distance had a bug in it where they were pointing to an implementation of the Jocard distance in another Apache project that had been fixed and it's like does this make sense guys does this make sense but I mean I suppose that's just the slow game of open source development so I don't know how many people are on the commons listserv if it hop hop into the dev list and I mean there's no reason that we shouldn't start that I'm all for it and if you think of fundamental algorithms that are in the text processing space that seem like they could be widely reused there's no reason to not open a Jira issue and just commit the code everybody benedict everybody so lots of thanks lots of thanks go to benedict and and Bruno whom I don't think is at the conference for um getting the component as far as it was when I jumped in in November um well hey hey I don't know it was there um I don't know I I think that the fundamental algorithms if we don't have them are definitely welcome sure um sure sure sure yeah that seems quite reasonable um get with me after and we can figure out where that stuff is and start moving on it better different languages which have different characters there for example a word is just one one symbol and not made up of characters so it makes sense at all they they do and that opens us into the utf 8 utf 16 transition right so in utf 8 so I don't know how many people are familiar with this but in utf 8 up to what is it 65 000 whatever the maximal is on on utf 8 right those are all represented as individual characters and then as soon as you cross over past that into utf 16 the characters start being represented as pairs of characters that are conjugate pairs and um like if you're trying to represent an emoji an emoji is two characters under the hood and so fortunately um the java api had affords the ability to to predicate upon is conjugate pair and you can do um counting of characters or movement across characters that are these conjugate pairs and I think that that affords us the ability to operate in other languages um it still is pretty subtle when you start thinking about last character of and things like that when you're trying to get that out of a string um I worked on a pull request specifically in the lang earlier this year where we were dealing with that and we decided to go with what um the java string api does which um takes a conjugate pair and counts it as one and that's pretty standard um so I mean that's that's kind of where we settled on that and um I tend to defer to if there is a java implementation of an algorithm for doing that sort of thing defer to what java does because that way we're not doing two different things and getting different results out of it but um there are some subtle mechanics under the hood there that happened when you're in the character space um we've technically got another five minutes but um I yeah I deal with sure hmm yeah that that doesn't seem unreasonable like like whether it's a utf-8 character or whether it's a whatever wingdings wingdings um yeah those the that all seems quite reasonable for the library and I mean we're definitely in a fledgling sort of space right now in that we've just come out with a 1.0 we're trying to focus on textier things than lang so I'm I'm pretty open as to what should be here I'm not real prescriptive about what should be here I think that it I don't know the more word we get out and the more people get interested in it the easier it is to kind of say oh that's the direction we should go because that's where we're going um that tends to be my philosophy on on things generally is to you know just be like okay well what's on the plate now well I've got these things I'll do those things I'll get a release out the door and if we can start moving you know in a direction regardless of the direction I'm okay with it that's really hard to do yeah um yeah yeah um that that opens the question of does that should that be in this one or should we try to work on that in open NLP or where is the line between the two and I don't think we've actually ever gotten to that boundary yet so I mean I'm open to those discussions and there's no reason that we shouldn't have those discussions I just we haven't gotten to that place where it's like this seems something this seems like it's really complex and it should go in the natural language processing toolbox as opposed to our toolbox but I mean there's no reason to say that we shouldn't start thinking about those sorts of things um I think I'm going to let the floor go to the next guy so that uh I'm not crowding the the microphone so um thanks everybody I I appreciate it and um feel free to grab me in here outside after this and we can keep plugging away at the at the new library thanks