 Yeah, when it came time to decide what kind of talk to do this year at FOSSTEM, I didn't have any good idea. But happened that at around that time there were various Garrett patches that had issues with something that sounds as trivial as URL handling. And so I told these poor people that had the issues, what exactly they had to, which functions exactly they had to use, where and what parameters exactly to pass in so that it would work. And yeah, this is something that is decades old by now. The code has morphed through various stages, but still is apparently a shape so that whenever somebody wants to use it, it's difficult. For one, probably difficult because these issues are on the surface rather trivial, but then if you dig deeper they are difficult, but also because we fucked this up somewhat over the years. I fucked up much of it so I feel privileged to talk about it. And so here we are. A URL or URI, something rather trivial. So I'll only talk about the textual manipulation of these things within LibreOffice, not the other issue of you have an URI and that points at some documents somewhere on the internet and you want to get the content out of it. That's not the thing we are talking about here, the even more trivial part of assembling these, passing them around, reading them in. So what's a URI? There's an RFC for that. It has some components, a scheme, some authority part with maybe user info. Much of that can be left out. You may have a query and a fragment. At some point in time during the formalization of that there were various RFCs over the time that formalized this and improved this. And at some point in time there was a concept of having a URI reference that also included the fragment. By now the URI is the whole thing, including the fragment, but this jargon of calling this a URI reference, the whole thing including that fragment, optional fragment part there that still shines through in our URI. So I thought I'd clarify why we have these funny looking UNO functions that I'll talk about later. Now let's go back in time to late 90s, last century. I just started in a company called StarDivision actually doing StarDivision development. We had our own thing to handle these URLs, iNet URL object. Of course it didn't use all these wonderful standardized names. It just had invented its own completely different set of function names. So you have a getParam for getParameter, that's the query part. GetMark gives you the fragment part. It even doesn't talk about schemes, but about protocols. And there was a fixed enum list of protocols that we knew and everything else. We would say that's not a URL, that's invalid. So if you had a HTTPS URL, it wouldn't have something like that at that time. Plus it was horrible in that it insisted on, I know what the specification is. I mean it didn't know what the specification is else it wouldn't have called it getParam but getQuery. But on the other hand it said, I do know that there is things you need to write in this way or in that way. So it always tried to normalize all your input into the one form that it believed is the right one or the canonical one. And back then also back in the late 90s there was still some web servers out there that had their own quirks. And for example it might have been some Caputin web server that wanted a URL written this way where it didn't want it to have an exclamation point, but wanted that for whatever weird reason be encoded with this percent encoding that is standard for URLs. So we have an exclamation point, it's ASCII number 21 in hex, so this is percent 21 there. But if you enter this into the hyperlink dialogue in Staroffice back then or LibreOffice today, what would happen is that internally we would rewrite that as the canonical URL with the exclamation point there and that's what then goes over the wire to that HTTP server. By now most servers are working well, so this issue that was quite a pain point back then has more or less vanished. The code is still as broken as it was back then, our code, but the code outside the web server code has gotten much better over the years or decades. So that's no longer that much of an issue. Another funny issue with this is that as I said with this URL object, a new fixed set of url schemes and parsed these and understood what exactly they would need to look like. And for file URLs which are quite common when you work with your local files or pass your files to some other application or want to invoke some Java functionality on your files and then you pass around file URLs and this inert URL object still insists on file URLs having three slashes because there needs to be this authority component even if it is empty or it could also be local host, but in the RFC of course you can leave it out. In our implementation it still has to be spelled out at these two funny slashes. And when Java creates a file URL for example it always uses this, when you that then pass into LibreOffice for example ask it on the command line to open that then it used to do this. I think we're now by now clever enough to or there were also other reasons why we need to map back and forth file URLs. I come to that later and I think by now we handle this special thing there as well. So a URL is the syntax like these two slashes for the authority part even if it is not there in the file URL. And the other part of it is the kind of payload data that's included there which again seems rather trivial. For example there's a data URL format and what you can encode with a data URL is pure data bytes octets in RFC speak. So each data item corresponds to an ASCII letter or the other way around each ASCII letter corresponds to some data item. So you have four data items here in this data URL. The comma is syntax part, meta syntax part. So you have an A, a B, a C and a space and they correspond to these four bytes and the B. You can also write as this person encoding stuff and the space you need to write as a person 20 encoding because spaces are not allowed in URLs. So this is easy. You have some stuff and you decode it into bytes. For a file URL what the payload is is of course a file path or a path name on your system. And the simple case again is quite simple. A foobar but on Linux is a foobar but on path name there. What is less clear is if you have some non ASCII characters in your path name. How do you construct the URL for that? Back in those days we were talking about that wasn't that clear. By now again thank you to the outside world. Everybody moved to UTF-8 so by now it is quite clear that you want to encode this upload as the corresponding UTF-8 characters and then encode that into the URL. Back then there were still especially on Windows lots of different encoding systems in use. By now Windows moved to the 16 bit path names so this is even easier by now. But back then it was tricky how to do this conversation and you would sometimes do it right and sometimes it would be done wrong. Or it just wouldn't work and what made this even more problematic is still back late. I think now we are still in the 90s. StarOffice wanted to go into this new brave new world of Unicode. I mean we were text processing application first and foremost and there was this new Unicode thing and that needed to be embraced on all levels. Not only typing text but the whole UI thing, everything suddenly Unicode is the thing to go. Before that it had been easy. We had one string class of course a round one handwritten course. And URLs were written as strings and if you took something out or combined some strings into a URL everything was easy and dandy. Then we had two string classes one for the old code that had not yet been transformed into the new world and another one for the new code that was a UT of 16 Java inspired Unistring class and the URLs. Of course because we were transitioning the code there were also cases of either URLs using old 8-bit and new 16-bit strings and so now the problem starts. If you want to get out a 16-bit Unicode string out of this you want to get the payload out of this. The file path, the path name out of this thing it's not only interesting that this is maybe a 16-bit string that you pass into the URL object but also the individual bytes represented by this. You now also need to think about you have bytes here in the payload like in the data URL I showed earlier. So this is just bytes but what you want to have out of this is this Unistring that wants to work on 16-bit units. So how do you do that? There's different ways and again depending on your needs sometimes the one way is the right one sometimes the other is the right one. So what might happen here is depending on how you do it you get out some rather different strings and that's what started to show up back then. So the easy cases of this transition to the wonderful Unicode world were easy because if you had your foobar TXT then that kept working by magic because there's no strange characters in there. But in other cases it started to get messy and the other messy thing is that by then I had become the maintainer of that dreaded class and what made it even more messy is back then late 90s the way we built the Star Office is not as today. So you do a change on LibreOffice, you do a grab over the source code, you change things all over the place, you do a make check and half an hour later you're done. No. Back then if you wanted to do a change you had to write a mail. Dear colleagues that was called a change that must be done an important change that everybody must do otherwise the whole thing wouldn't compile on the next compile. The next compile was not done by you on your desktop machine but was done by a special group of engineers that built that thing. And if somebody had either not declared that something needed to be changed everywhere or somebody else had not done that change then everything would break of course. And so you wouldn't do these changes yourself all over the place so I would never have been allowed into the writer code to do these changes there. The writer people said, what you want to do a change there but that's work for us and that's lots of work we can't do that so forget it but I needed to get this change into the code some way so that I wouldn't notice and wouldn't be too angry with me for all the work I imposed on them. So what did I do I I just to all these functions in this dreaded URL object class I added some more parameters but I added them as the defaulted parameters so that nobody would notice. I had to be creative with what the parameters did because as I showed on the previous slide depending on whether you want the one or the other output for the foobar TXT example without any unicoded it all works fine but if you come to the fine point then it's very important which one you choose. So I had to choose one for each of these functions and of course guessed wrong for for most of the cases but initially nobody noticed because for the easy cases of a foobar TXT everything continue to work so. I save my ass with that and only shovel the hard parts under the rocks. And that's what where we are still today so when today somebody has a Garrett patch that doesn't quite work and I need to tell them exactly which parameters to stuff in where that's because 20 years ago I shovel that under the rock. So what what what are these wonderful strange parameters. Yes so you whenever you want to get something into or out of a URL objects and some text you need to decide what these bytes actually or how you want to represent these bytes and and whether you want to if you have a complete URL. For example there is a function get main URL that gives you out of the complete URL again canonicalized of course because that's what this thing does so. If there's the strange file with a. A 10% 20 dollar file name whatever that should mean then it is encoded in the file as a 10% 25 20 because the percent sign itself needs to be encoded as percent 25. So when you want to get this URL back you still want this syntax there and not have a decoded so there's a function get main URL that gives you the whole URL back and that takes and a decode mechanism. This is this thing that I defaulted to something and I have four different ways to get the to get this thing back. This is either decode nothing or decode to an internationalized URI that is with Omlauts intact and there's this with car set that decodes everything despite the name being somewhat vague but it it's really a blow up in this case. And then there's this unambiguous that came later only because of too many problems. And if you use this blow up with car set thing which would have been the standard then everything would have been decoded so. This would have been the output with a dollar percent 25 is an encoding of a person sign so you decode this part to this then comes the 20 then comes the dollar. So this is this still looks like a file URL but it's a file URL for something rather different because that's the file 10 dollar not 10 percent 20 dollar. Because this person 20 if you interpret this as a file this whole thing as a file URL then this person 20 means a space character. So that's the place that's the situation where we got stuck. And then also what also happened is that we created some some very inventive uses of of URLs as well. One of them was this won't go into that expanding macro expanding stuff that generates URLs from other URLs or you are looking things. And we have this package thing because now ODT ODF came onto the scene and we have these files with the subcontent and we need to have files into the zip. But we also need to describe the original zip file itself. So we have these package URLs that include as authority the whole URL encoded. So you have a file URL and all the slashes need to be encoded and all the persons in there need to be encoded. And if you want to take that out again you have to be careful to take it out in the right way with the right decode mechanism parameter or else everything blows up which is the typical case of of problems where we where still get these carried patches that don't quite work. So what did we do? At latest at this time we we we all knew this in a URL object is just not going to work for that. So we invented some you know stuff. So this is now we're into the 2000s early 2000s the you know bubble is at its height. So I rewrote all this or started to rewrite all this stuff and you know which was also a bad idea in retrospect because you know is the thing that once you have written something in it or define something you cannot change it anymore because backwards compatibility with your extensions. So was very careful to to not to add too much nonsense that we would carry on forever. So what we do have there by now is rather little or close to nothing with very weird names because that was at the time on this Yuri reference thing was on vogue in in the RFC world. So I copied that from there eagerly. Now we have some very funny looking stuff with very long names that can do this conversation. But again only if you know exactly how to to write your 20 lines of code. So people don't do that either. So still ask me in their Garrett patches how to to get help for that and we can change that because you know can't be changed. So yeah. Wasn't that bright of an idea either. So here we are 20 years later still having fun with this. Thank you. And I hope there's no actual questions or Garrett patches at this time. No. Sorry. Thank you.