 Brings in the future. Thank you Okay Details for me. Sorry about the size default Google slides font, so I didn't mess with it too much Charles Nutter. I'm headiest HEAD I us on most services online Based in Minneapolis, Minnesota. So this is a lovely warm day for me today I've been at Red Hat for about four years now working full-time on JRuby and working full-time on JRuby since 2006 so just quick. Thank you to Sun Microsystems rest in peace Engine yard and Red Hat for for keeping the JRuby dream alive all these years Okay, so here's the agenda. I don't have a lot of time. So we'll just jump right in So talking about strings in Java. I've been doing Java since the beginning since 1.0 It's always been a character array and a length to go along with it Other bits and bobs in there sometimes to like save cash hash codes and such And in the beginning UCS2 was fine It was this that was totally satisfactory representation because 16 bits per character should be enough for all the characters in the world, right? Well, it turns out that's not quite true So later on it was moved to UTF 16. We keep the 16-bit characters. We have to sacrifice constant Random access time for certain types of characters in the higher planes like emojis for example But generally we still have constant access time and it still fits in the typical character size Now one of the biggest problems that that comes up for a language like JRuby that has a very different sort of string is that String itself is just pervasive throughout Java APIs. We got char sequence too late It was an underpowered interface and just nobody really uses it So we have to be able to fit into a string APIs or if we use our own string We've got to actually roll an entirely new version of it that doesn't require a Java strings So that's that's kind of a difficult problem for us recently there's been work to improve the the packing of ASCII bites so seven-bit ASCII in open JDK 9 and Java 9 We'll actually be able to represent itself as a compact array of bytes under the covers so you don't have that waste of eight or nine bits for every single character and There's a possibility. I've heard that maybe we could even get utf-8 in there it'd be kind of an opt-in because of the expectations of constant random access time but I would turn it on because I would love to just be able to pull utf-8 bytes off the wire and not not deal with anything not do any transcoding And this is pretty much it this is this is strings on Java They're gonna be utf-16 and if that doesn't work for you then you're just kind of stuck Okay, so problems here The constant encoding overhead anything that you do when you have to read from the wire read from the file system and deal with Characters you've got at least the decoding cost of bringing it in from bytes into characters because it needs to turn into utf-16 Vanishingly small amounts of IO on the actual internet or on the real networks and actual files are in utf-16 So you pretty much always have to suffer through this And that's not cheap. That's a cost and then if you're going back out to the wire Most of the time if you process these characters you're going to do something with them It sends them back out. You've got the encoding cost all over again So it really does limit how we can do high-speed IO any anything that you need to process characters for Like I mentioned the ASCII waste so if you're just using the bottom seven bits for ASCII characters then you're wasting at least a full byte for every character and Early on that was one of the things that Java haters really got into like every string is now twice as large Even though I don't need it to be twice as large It's kind of funny that we have a whole stack of like super compact Bite codes to try and fit more bite code into our set top box in 1995 and yet It's utf-16 strings that waste way more than those bite codes ever would so interesting Decisions there So we also have to deal with the fact that in Ruby binary data is also represented in the same structure The string is basically just a wrapper around bytes and it might be binary. It might be characters Really hard to do with Java It wants those characters to be valid if you start shoving bogus characters in there all sorts of weird things will break You might wonder why we want to do this Well, we've got various cases where we want to embed arbitrary binary data into a class file or into a constant pool There's no good way to do this other than forcing it into a string probably invalid string And then shoving it into the class file itself So that's another issue that we've had to deal with and then there's the CJK problem The the Chinese Japanese Korean problem The basically the issue here is that each of these languages have their own representations of certain characters like kanji and hanza in Chinese The Unicode folks decided that the character that looks like this in Japanese is the same as the one that looks pretty much Like this in Chinese and so they get one code point when you imagine They weren't very happy about this because now round tripping through UTFs through through any unicoding coding You will lose which of those two versions. It actually was you can't take it back out and have the original character in their estimation, that's a Stoneable offense and it's a real serious problem for for them So that's why in Japan they still use shift jist and China they use big five and other Chinese encodings because of this problem So strings in ruby strings in ruby are basically just a byte array and links similar to the chart character array and a length in Java and the length in this case refers to the byte length not the character count Up up until ruby up through ruby 187 There was a single global encoding that you could set at the command line I think it would default to assuming it's ASCII, but you can do some Unicode operations You could specify that all strings were Unicode and then it would do some additional character validation or specify some other encodings like switches And like I say some of some operations that were specific to Unicode were always available like regular expressions Have a flag that says treat this as a Unicode regular expression rather than an ASCII one And that worked okay, but in a world of many encodings and dealing with the wider internet and lots of files in different formats They decided they needed something better to negotiate all these different encodings and strings And so in ruby 191 and and from from then from then on In addition to having the byte array and the length of those bytes They had an arbitrary encoding and that might be utf 8 might be utf 16. It might be big 5. It might be ebsidic There's a lot of different encodings. So they opted to basically say let's let each string decide its own encoding So that when we do have to deal with disparate sources of data We don't have to transcode everything to some intermediate. We can actually use mixed encoding strings throughout the system It's obviously very complicated. We had to also implement this to keep up with ruby features and it's taken Years for us to match up with the level of compatibility. We needed for see ruby, but it does work surprisingly well Most of the time you don't have a lot of mixed encoding environments Usually it's like a utf 8 that you're dealing with But if you do have mixed encoding environments it negotiates this stuff pretty well So it's it's a complex design some would say over complex, but it works surprisingly well So problems with ruby's Multilingualization is what they call it so by default it's utf 8 and so the standard utf 8 problem of not having constant Random access time you can't just say give me the end character and have it immediately It has to be a walk of all characters because they're variable widths If that's a problem for you, however It has support for utf 32 and so you could say internally I want to use all utf 32 So it's all constant access time and you can get around this and it works just find that way too And then of course you have to pay the encoding and decoding costs Like I say it's possible to have an arbitrary number of coding encodings floating around in the system But it's rarely done and it's it's almost never a problem Usually you're dealing with one encoding or you're trying to convert you do actually convert things into utf 8 and use that internally So that doesn't come up too often If you've got lots of strings with different encodings every time you do a string operation with another string You've got to find some common ground there And there's various heuristics that ruby uses to say if it's a utf 8 over here and it's shift just over there We'll negotiate it to whatever the best common encoding for those two is and your resulting string will be one of those encodings Most of the time you never even look need to look at this the ruby Subsystem hides the fact that it's a byte array in an encoding and you just deal with code points and characters most of the time The bigger one is that all of the support libraries have to be able to do this and as far as I know There's only one regular expression engine in the world that can basically work over Arbitrarily encoded bytes and that's the one that that ruby imported called oniguruma that we ported for j ruby And then of course IO needs to have a more much more complicated Pipeline for reading in bytes turning them into what some internal coding and then going back out because we handle arbitrary number of encodings Like I said, it's complex and the early implementations were fragile, but it has matured pretty well and things generally just work How are we doing here? okay So strings in j ruby prior to 2006 we did have a Java string based or character based or Implementation But obviously all these different encodings were a problem representing binary data was a problem We realized we had to do had to follow the ruby approach a little bit more closely Unfortunately that meant we'll a lot more work and up until maybe even last year We were still working out the bugs in our implementation of multi-lingualization Character logic that had to be duplicated the regular expression We had to port that regular expression engine over all of the IO encoding and transcoding logic had to be ported over So it's been years of work, but it works now We have very few or no known bugs compared to see ruby in our Encoding library and our regular expression library and they're just Java libraries that you can use so that's what we're gonna show today a little bit So here's the libraries that we have I won't talk in depth about byte lists You can just imagine it's a string buffer for byte arrays pretty much all there is to it aggregates the same things that a string does Array of bytes a byte length and an encoding and then provide some operations over those to add and insert and whatnot J coatings is the encoding subsystem and so this has all of the metadata for all of the different encodings that are supported like If I see this byte how many additional bytes do I need for a multi byte character or Given this code point. What are the bytes that would that this encoding would represent those as? So functional things like going back and forth between bytes and characters or code points It also has a rather complicated inner loop of a Transcoder that can take a byte array on one side that is known to be an encoding a and Output it to a byte array on the other side in another encoding without doing any sort of intermediate step In order to do this in Java you would need to decode everything to utf-16 re-encode it back into the other encoding and Wasting all that time in between and so in general the the encoding logic that we have the transcoder here can do those sorts of Those sorts of conversions significantly faster than Java can The regular expression engine is a port of the engine that that the CRuby folks adopted called oniguruma Oniguruma is basically a regular expression engine that has it can work on arbitrary encodings and it has pluggable syntax So there's multiple different versions of regular expression syntax as well as being able to use any encoding It's the most customizable regular expression engine that's out there for sure And our port of it actually has better performance than Java util regex for doing matches Plus you don't have the cost of having to pull in bytes as characters before you start doing your matching You do a read do your matching send it back out. No characters involved. No transcoding involved. It's pretty pretty cool Okay, so a little bit more detail about j-codings Like I said, it's all the character data and the metadata for the different encodings lots are supported All the ones that you would typically use and a several that you will never want to use Actually supports more encodings than the default set of of encoding decoding logic. That's in Open JDK and at least one of I think ISO 88 59 11 I I have a patch that I don't know if I've gotten in yet to add Decoders and encoders for Java, but yeah, very complete encoding support all the the weird Asian and Other European encodings that you don't see often except in those countries And then all the IBM and Windows code pages are even in there. So it supports all this stuff You can do this direct transcoding like I talked about basically take bytes to bytes without any intermediate step much faster than going through UTF-16 And that was an epic piece of code if you ever want to see some interesting code The C code there had nested nested switches and nested loops and then branches and go-tos that would go out to other switches and cases And I that was a fun fun week porting that to Java So bonus features that are kind of cool you can have it replace anything that is not a valid XML character with its entity Representation just along the way as it's doing encoding or decoding you can also have it negotiate Carriage return line feeds normalize them all to carriage return Right to like CRLF on the way out and and carriage return on the way in various levels that you can configure that And lots of folks are actually using this in the wild Obviously Jay Ruby uses it the Facebook guys use it for some high-speed character IO where they need to process stuff and send it out Quickly they didn't want to pay the decoding and encoding cost every single time Truffle Ruby obviously uses it because they they grew out of Jay Ruby and JetBrains for any Ruby related stuff. They use it as well internally So here's a quick simple example of just some of the metadata APIs. So we've got our utf-8 bytes here Seven bytes long for a five-character string We can see what the actual character count of it is by going to the utf-8 encoding and asking how asking it to count it up We can see how wide an individual character is at a particular offset. So we know how many more bytes to read And then we can go back and forth between code points bytes to code points and back again So simple stuff, but all the things you would need to go back and forth with bytes and characters Here's the transcoder. So we open a new transcoder from utf-8 to utf-16 We've got our source and destination bytes here And this is actually a test that's in the J coding suite and then we do our convert. So we've got our source we need to pass a Essentially passed by pointer the start position for this and it's going to let us know how far it was able to decode Where we want to start the destination array The destination start and again, it's going to be different amounts of bytes depending on which encoding we're going to So we need to get that out again. Love to have multiple return values here And then the other details about the destination the zero there is flags for other things like changing slightly changing How it handles invalid characters what it whether it reports it or raises exceptions and so on so similar to Java utility regex there And then we we actually can do our conversion and get the bytes out never have to pass it through characters in the middle So only Garouma, Joni is our port of it and Java This has been pretty well functional for four years. We have occasional updates and fixes But it's been working for quite a long time now. So it's fairly stable It is a bytecode machine and it's stackless which is very important when we talk about some failings of Java utility regex There's certain structures of regular expression that the existing implementation will deepen the stack for and then deepen the stack and then Deepen the stack so there are cases that you cannot match with Java utility regex over like very large input because it'll blow the stack out Obviously, this doesn't have that problem because it's stackless Like I say highly configurable different grammars There's a there's a syntax for Java for Ruby for JavaScript for a couple different other languages You can pick which syntax you're using for your regular expressions and just plug it in And again lots of the users in the wild including J Ruby Nazhorn did a version a modified fork of this that is all character arrays. So it doesn't have the advantage of different different encoding support, but it's much faster matches JavaScript's syntax and so on Okay, so here just a couple quick Joni examples We have our regular expression that we can create with just a simple string pattern or Specify different syntax that we want to use with it here. We got our matcher just like in Java utility regex We do our search and the options here provide various ways of doing those matching and altering how it does the search And then one of the other nice features here That's not in Java utility regex if you have a regular expression. That's a weird case and runs forever You can do just a normal thread interrupt and kill that match So you don't get stuck in in some inner loop of a regular expression. That's never going to return Last part of the Joni examples here. So here we want to get our regions out for for pulling groups off of this and we can get a start index and an end index and Basically just go directly to the bytes. We don't have to turn anything into characters We can go back to the same byte array do no copying and have regular expression matches with groups Okay, so just a quick note on performance. Like I said, J coatings definitely is is is faster than going through utf 16 Hard to compare it other than that because it doesn't match the same way that char sets work But it's it's pretty good performance Joni can be significantly faster than Java utility regex. I mean there's certain cases where Java utility regex just blows up but two to three faster for most of the things that I've tested So if you want to pull bytes off the wire really fast and do quick matches against them This is a great one to look at and the being interrupted was great Like I've had plenty of regular expressions. It just went off into the weeds and never seemed to come back It's nice to be able to kill those off and reexamine it Okay, so wrapping up So Java string Really unchanged for decades as far as the internal implementation is starting to evolve some of these features the support for as Compact ASCII strings and so on hopefully that process will continue and we'll get a more robust string that that cuts some of that overhead out for us and Learning from what Ruby's done and what we've ported and Reimplemented in JRuby may advise some of those features in the future So hopefully we'll be able to help improve Java string in the future But meanwhile these libraries are available if you take a look under the JRuby organization in GitHub They're all right there bite list Joni and J coatings and they're all in Maven. So we can pull them down and they're ready to go. That's all I got Thank you. We have a little time for questions. Yeah Hey, so the Java string Messages like string index or for string compares things like that. They are heavily optimized with intrinsics and taking use vector instruction and stuff like that So how do you compete with these? Operations on your class class. Well, that's a good question. I'm not sure where we stand on that We have done some of the unsafe tricks that folks have used to do like, you know 64-bit stride for string searches and so on And those those help get us pretty close to where the Java performance is usually And we haven't done a whole lot of exploration because again, it's it's it's a weird comparison We're working with bytes and generally non-constant access time UTF-8 handling along the way Versus UTF-16 strings, which are just read right out. So but I would imagine we probably are reasonably competitive I'd like to run some more numbers though. Okay. Thank you All right, well see you around. Thanks