 Hi everyone. Sorry? Okay, thank you. Yeah, when it came time to decide what to, what kind of talk to do this year at Foster, I didn't have any good idea, but happened that at around that time there were various Garrett patches that had issues with something that sounds as trivial as URL handling. And so I told these poor people that had the issues, what exactly they had to, which functions exactly, they had to use where and what parameters exactly to pass in so that it would work. And yeah, this is something that is decades old by now the code has morphed through various stages, but still is in apparently a shape so that whenever somebody wants to use it, it's difficult. It's for one, probably difficult because these issues are on the surface rather trivial, but then if you dig deeper, they are difficult, but also because we fucked this up somewhat over the years or I fucked up much of it so I feel privileged to talk about it. And so here we are. A URL or URI, something rather trivial. So I'll only talk about the textual manipulation of these things within LibreOff. It's not the other issue of you have an URI and that points at some document somewhere on the internet and you want to get the content out of it. That's not the thing we are talking about here. Even more trivial part of assembling these, passing them around, reading them in. So what's the URI? There's an RFC for that. It has some components, a scheme, some authority part with maybe user info. Much of that can be left out. You may have a query and a fragment and at some point in time during the formalization of that, there were various RFCs over the time that formalized this and improved this. And at some point in time, there was a concept of having a URI reference that also included the fragment. By now, the URI is the whole thing, including the fragment, but this jargon of calling this a URI reference, the whole thing, including that fragment, optional fragment part there, that still shines through in our URI. So I thought I'd clarify why we have these funny looking UNO functions that I'll talk about later. Now let's go back in time to late 90s, last century. I just started in a company called StarOffice. No, StarDivision actually doing StarOffice development. We had our own thing to handle these URLs, iNet URL object. Of course, it didn't use all these wonderful standardized names. It just had invented its own completely different set of function names. So you have a getParam for getParameter. That's the query part, getMark gives you the fragment part. It even doesn't talk about schemes, but about protocols. And there was a fixed enum list of protocols that we knew and everything else we would say that's not a URL, that's invalid. So if you had a HTTPS URL, wouldn't have something like that at that time. Plus, it was horrible in that it insisted on I know what the specification is. I mean, it didn't know what the specification is else it wouldn't have called it getParam but getQuery. But on the other hand, it said, I do know that there is things you need to write in this way or in that way. So it always tried to normalize all your input into the one form that it believed is the right one or the canonical one. And back then also, back in the late 90s, there was still some web servers out there that had their own quirks. And for example, it might have been some Caputin web server that wanted a URL written this way, where it didn't want it to have an exclamation point, but wanted that for whatever weird reason be encoded with this percent encoding that is standard for URLs. So we have an exclamation point is ASCII number 21 in hex. So this is percent 21 there. But if you entered this into the hyperlink dialogue in Star Office back then or LibreOffice today, what would happen is that internally we would rewrite that as the canonical URL with the exclamation point there. And that's what then goes over the wire to that HTTP server. By now, most servers are working well. So this issue that was quite a pain point back then has more or less vanished. The code is still as broken as it was back then our code, but the code outside the web server code has gotten much better over the years or decades. So that's no longer that much of an issue. Another funny issue with this is that as I said with this URL object, new fixed set of schema of URL schemes and parsed these and understood what exactly they would need to look like. And for file URLs, which are quite common when you work with your local files or pass your files to some other application or want to invoke some Java functionality on your files and then you pass around file URLs. And this inert URL object still insist on file URLs having three slashes because there needs to be this authority component even if it is empty or it could also be local host. But in the RFC, of course, you can leave it out. In our implementation, it still has to be spelled out at these two funny slashes. And when Java creates a file URL, for example, it always uses this. When you then pass into LibreOffice, for example, ask it on the command line to open that, then it used to do this. I think we're now by now clever enough to. Or there were also other reasons why we need to map back and forth file URLs. I come to that later and I think by now we handle this special thing there as well. So a URL is the syntax like these two slashes for the authority part even if it is not there in the file URL. And the other part of it is the kind of payload data that's included there. Which again seems rather trivial. For example, there's a data URL format. And what you can encode with a data URL is pure data bytes octets in RFC speak. So each data item corresponds to an ASCII letter or the other way around. Each ASCII letter corresponds to some data items. So for data items here in this data URL, the comma is syntax part, meta syntax part. So you have an A, a B, a C and a space. And they correspond to these four bytes and a B. You can also write as this person encoding stuff and the space you need to write as a person 20 encoding because spaces are not allowed in URLs. So this is easy. You have some stuff and you decode it into bytes. For a file URL what the payload is, is of course a file path or a path name on your system. And the simple case again is quite simple. A foobar but on Linux is a foobar but on a path name there. What is less clear is if you have some non ASCII characters in your path name. How do you construct the URL for that? Back in those days we were talking about that wasn't that clear. By now again thank you to the outside world. Everybody moved to UTF-8. So by now it is quite clear that you want to encode this umlaut as the corresponding UTF-8 characters and then encode that into the URL. Back then there were still especially on Windows lots of different encoding systems in use. By now Windows moved to the 16-bit path names. So this is even easier by now. But back then it was tricky how to do this conversation. You would sometimes do it right and sometimes it would be done wrong or it just wouldn't work. And what made this even more problematic is still back late. But I think now we are still in the 90s. StarOffice wanted to go into this brave new world of Unicode. I mean we were text processing application first and foremost and there was this new Unicode thing and that needed to be embraced on all levels. Not only typing text but the whole UI thing. Everything suddenly Unicode is the thing to go. Before that it had been easy. We had one string class, of course a round one, hand written course. And URLs were written as strings and if you took something out or combined some strings into a URL everything was easy and dandy. Then we had two string classes. One for the old code that had not yet been transformed into the new world and another one for the new code that was a UT of 16 Java inspired Unistring class and the URLs. Of course because we were transitioning the code there were also cases of either URLs using old 8-bit and new 16-bit strings. So now the problem starts if you want to get out a 16-bit Unicode string out of this, you want to get the payload out of this, the file path, the path name out of this thing. It's not only interesting that this is maybe a 16-bit string that you pass into the URL object but also the individual bytes represented by this. You now also need to think about you have bytes here in the payload like in the data URL I showed earlier so this is just bytes but what you want to have out of this is this Unistring that wants to work on 16-bit units. How do you do that? There's different ways and again depending on your needs sometimes the one way is the right one, sometimes the other is the right one. So what might happen here is depending on how you do it you get out some rather different strings and that's what started to show up back then. So the easy cases of this transition to the wonderful Unicode world were easy because if you had your foobar TXT then that kept working by magic because there's no strange characters in there. But in other cases it started to get messy and the other messy thing is by then I had become the maintainer of that dreaded class and what made it even more messy is back then late 90s the way we built the star office is not as today so you do a change on LibreOffice, you do a grab over the source code, you change things all over the place, you do a make check and half an hour later you're done? No. Back then if you wanted to do a change you had to write a mail. Dear colleagues who didn't do that it was called a change that must be done an important change that everybody must do otherwise the whole thing wouldn't compile on the next compile. The next compile was not done by you on your desktop machine but was done by a special group of engineers that built that thing and if somebody had either not declared that something needed to be changed everywhere or somebody else had not done that change then everything would break of course and so you wouldn't do these changes yourself all over the place so I would never have been allowed into the writer code to do these changes there, the writer people said what you want to do a change there? But that's work for us and that's lots of work we can do that so forget it but I needed to get this change into the code some way so that I wouldn't notice and wouldn't be too angry with me for all the work I imposed on them. So what did I do? I just to all these functions in this dreaded URL object class I added some more parameters but I added them as defaulted parameters so that nobody would notice had to be creative with what the parameters did because as I showed on the previous slide depending on whether you want the one or the other output for the foobar TXT example without any unicoded it all works fine but if you come to the fine point then it's very important which one you choose so I had to choose one for each of these functions and of course guessed wrong for most of the cases but initially nobody noticed because for the easy cases of a foobar TXT everything continued to work so I saved my ass with that and only shoveled the hard parts under the rocks and that's where we are still today so when today somebody has a Garrett patch that doesn't quite work and I need to tell them exactly which parameters to stuff in where that's because 20 years ago I shoveled that under the rock. So what are these wonderful strange parameters? Yes so whenever you want to get something into or out of a URL object some text you need to decide what these bytes actually or how you want to represent these bytes and whether you want to if you have a complete URL for example there is a function get main URL that gives you out of the complete URL again canonicalized of course because that's what this thing does so if there's this strange file with a 10% $20 file name whatever that should mean then it is encoded in the file as a 10% $25.20 because the % sign itself needs to be encoded as %25 so when you want to get this URL back you still want this syntax there and not have it decoded so there's a function get main URL that gives you the whole URL back and that takes a decode mechanism this is this thing that I defaulted to something and I have four different ways to get this thing back this is either decode nothing or decode to an internationalized URI that is with Omlaut's intact and there's this with car set that decodes everything despite the name being somewhat vague but it's really a blow up in this case and then there's this unambiguous that came later only because of too many problems and if you use this blow up with car set thing which would have been the standard then everything would have been decoded so this would have been the output with a $1.25 is an encoding of a person's sign so you decode this part to this then comes the $20 then comes the $1 so this still looks like a file URL but it's a file URL for something rather different because that's the file $10 not 10% $20 because this person 20 if you interpret this as a file this whole thing as a file URL then this person 20 means a space character so that's the place that's the situation where we got stuck and then also what also happened is that we created some very inventive uses of URLs as well one of them was this won't go into that expanding macro expanding stuff that generates URLs from other URLs or you are looking things and we have this package thing because now ODF came onto the scene and we have these zip files with the subcontent and we need to have files into the zip but we also need to describe the original zip file itself so we have these package URLs that include as authority the whole URL encoded so you have a file URL and all the slashes need to be encoded and all the persons in there need to be encoded and if you want to take that out again you have to be careful to take it out in the right way with the right decode mechanism parameter or else everything blows up which is the typical case of problems where we still get these garbage patches that don't quite work so what did we do? at this time we all knew this URL object is just not going to work for that so we invented some UNO stuff so this is now we are into the 2000s, early 2000s the UNO bubble is at its height so I rewrote all this or started to rewrite all this stuff in UNO which was also a bad idea in retrospect because UNO is the thing that once you have written something in it or defined something you cannot change it anymore because backwards compatibility with your extensions so I was very careful to not add too much nonsense that we would carry on forever so what we do have there by now is rather little or close to nothing with very weird names because that was the time when this URI reference thing was on vogue in the RFC world so I copied that from there eagerly now we have some very funny looking stuff with very long names that can do this conversation but again only if you know exactly how to write your 20 lines of code so people don't do that either so still ask me in there Garrett Patches how to get help for that and we can't change that because UNO can't be changed wasn't that bright of an idea either so here we are 20 years later still having fun with this thank you and I hope there's no actual questions or Garrett Patches at this time you're still responsible no sorry thank you