 I'm Mike Churico. I work at Google as a data scientist. I've kind of just had an amateur interest in linguistics for a long time is kind of how I got involved in learning a lot about translations and how they work. And Mike Lawrence is from R-Core. He is kind of in charge of maintaining translations for R-Core. I don't know if you want to add anything else. So yeah, we'll talk about like kind of the state of things for translations in R as of today. And then we'll talk about kind of the mechanics of how translations work. There's like the system from GetDex and how that works and how the interfaces with R. Then we'll talk about some like common pitfalls and how to actually do translation things that are common stumbling blocks in practice. And then we'll break out and start doing actual translations. So R has been supporting translations since about R 2.1, which is about 16 years ago, when Brian Ripley added integration with GetDex to start doing translations for R. The first release had Japanese, Italian, and simplified Chinese. And as recently as 2021 Lithuania was added. And as of now, there's like 16 languages and various degrees of maintenance. Yeah, I could read them off, but I think you have the idea. So R has a lot of messages. Probably the vast majority of them you will never see or hopefully nobody ever sees. But yeah, so any error or message or verbose message warning that gets produced by any of R itself in its C code in stats and utilizes any of the default packages. Any of those are kind of eligible for translation. And there are as of today about 5500 of those. And in the beginning was about 2000. So it's going by about 2x. In just base alone, there's about 2500 messages. That's like 1900 C error messages and 600 from the R side. And stats has another 1000 errors like things you did wrong in running regressions. Yeah, so quite substantial message base. I got started on all this with translations in R doing translations for data table. And data table is another pretty old package, pretty large package. And data table has about 1400 messages. And R is like four times more than that. And the data table we did with a team of about 20 people and took about a month or two to get that done. So yeah, it's pretty substantial undertaking to do a full translation set for all of R. So yeah, how do translations work? I hope I'm not dating myself too much with this Mario reference. But okay, you're playing Mario, you get the Bowser, you throw him into the lava. And then okay, the famous message. Thank you, Mario, but our princess is in another castle. So if we were writing the code to do Mario from R, how would we produce that? Hopefully it wouldn't be an error message, but okay, you might write stop. Thank you, Mario, but our princess is in another castle. And then more code to send you on to the next level to try and find the princess again. So two key like components here are the messaging function, which is stop. And then the message itself, which is this string. Thank you, Mario, but our princess is in another castle. There's like a small set of functions which are recognized by R as like producing messages for translation. The three ones that you'll use almost all the time are stop, warning and message. Those three by default, any string that appears in them literally would be eligible to be extracted into a message for translation. And so what it stopped actually doing is devolving things to this workhorse, which is get text. Get text is like really the workhorse behind all the translations. So stop under the hood calls get text, warning under the hood calls get text, message under the hood calls get text. Basically anything that is doing a translation, which is taking a string and producing it to the user in another language will be calling the R function get text at some point. If that's happening on the R side and if on the C side, also there's get text interfaces that look basically the same. Okay, so what is get text doing get text is doing two things. Two things first is figuring out what the domain is domain is you can think of it basically as being the language. It's a little bit more technical term than that. But for all intents right now, you can just think of it as a language. So the current R session, is it running in English? Is it running in Arabic? Is it running in Bahasa? Is it running in Japanese? What is the current domain in the current R session? And where is this message being called from? So if it's being called from a packages namespace, it will look up the translations for that package. What do we mean by look up? Well, actually, after we figure out which package we're in, what language we need to translate to and which domain we're looking at, we find the corresponding.mo file. We'll talk about what.mo is in a second. But basically in this.mo file, there are these correspondences between a message and its translation. And there are maybe a bunch of messages in here. I wrote another example message here this flip over any two cards and see if they match. It's another thing from Mario if you remember. So within this.mo file, there's a bunch basically of messages and corresponding translations. And it's like a data structure basically that makes it easy and fast to go from the input message ID to the output, what they call message string, which is the translation. So, okay, we looked up this message ID and we found the corresponding translation. Please don't hold me to this Japanese translation. Actually, I was looking for the official Japanese translation of this Mario message, but apparently this game is so old that they actually in Japan this message was in English because writing these characters on a computer that old was too difficult and enough people understand English in Japan that they just left it in English. So it's not actually an official translation of this into Japanese. So this.mo file is kind of how things are working in real time. In real time, we look up the MO file, we look at the message, we get the translation. So where do we get these MO files from? So we can kind of work backwards from the writing the code to getting the MO file. Within your source code, you'll have these messaging functions which I mentioned like stop warning message. Then the other two that I haven't talked about yet are get text F, which is a way of doing template translations. It's basically like sprint F, but it's built for translation. And in fact, under the hood, get text F is really just running sprint F on the output of get text. And then there's end get text, which is used for plural translations. We'll talk about in a second. Then we as developers would do an extraction, which is to extract all of these strings from our source code and put them into a template file, which is this .pot file. So we would extract a mistake and it would come out as a message ID. We would extract, be careful. It would come out as this message ID. We would extract, percent the observations. It would come out as its own message ID. And then for plurals, they have their own slightly different syntax, which is you get a message ID for the singular case, you get a message ID for the plural case. And then the output has these corresponding zero and one, which for English, for example, this would correspond to the singular translation, this would correspond to the plural translation. In the pot file, this .pot template, the message stir is always empty, which is because this is just a template. It doesn't have an actual translation in it. So yeah, there is no translation in the template file. And the next step is to make translations. So we start from a .pot file and put translations in a corresponding .pot file. So I made a crude attempt at doing a Portuguese translation here. So I think we have some actual Portuguese speakers here. You can call me out. I'm totally wrong. But we start from this template, write translations into the .pot file. So here there is no translation in the .pot file. There is a translation. So each message would get its own translation. Portuguese is a very simple case because Portuguese linguistically does the same thing for pluralization than English does, which is that there's only two cases. There's either a singular case, which is corresponds to this first message, and the plural case, which is anything where the input doesn't have one. So even zero, you would write like this. And then the last step is using the GetText tool to compile this .po file into that mo file. And there's the tool to put it in the right place so that R can find it in runtime. So .po is something, a text file you can open in any text editor. You can open it in RStudio. You can open it in Atom. You can open it in Sublime, Emacs, anything. It's just a text file. This .mo file is a compiled thing. So if you open it, it's mostly gibberish. And the messages themselves too. So there's like a data structure here, basically, that is the compiled version. That's a lot. But that is the basic essential mechanics of how translations work. So I will pause and take questions for a minute or two. But the digest. This is the same also for packages, I think, or not? Yes. Yeah. So base R itself has the same system, which is GetText as to any package you would add on. Yeah. So the main difference for R itself is that the base package has a little bit funny mechanics of where it finds files that you get translations from. Yeah. So for packages, of course, all of the R messages you would just get from the R directory, all of the C and C++ messages you would get from the SRC directory. But for base itself, the directory structure of R2L is quite different. I don't know how many people are that familiar with how R2L is structured. But that's not that important. But basically that there are C files that are not just in the SRC directory under base. So there are messages pulled from a bunch of different places. But the mechanics are the same. I mean, the mechanics are the messages in C are marked in a certain way and pulled into this .pot template file. Once you have the template file, everything falls the same from there. For the strings, for the templated strings, does one need to use Sprint F or could you use something like glue that's also kind of templated after a certain fashion? Yeah. So I think I'm still kind of working out how to do extensions of translations to like more modern ways of building strings. In principle, what you need is to just pass a literal string to glue to get it translated first and then glue can apply the evaluation part and then proceed from there. So if you had something that is taking the input to glue and extracting that into a .pot file, then it can be translated and passed on to glue. Glue does the evaluation and things can proceed. As of now, it's a bit difficult to get that to work. You basically would have to write a couple wrapper functions that it would be a wrapper function that says this string gets passed on to glue, translated, and then evaluated. But yeah, so in BASAR, of course, it's not using glue. So we don't have to worry about that too much today. But yeah, for doing this on your own package, you would, I think, basically want to write a wrapper function that you might call it like glue F or glue translated or something that basically in your source code, you're using this wrapper function. And the wrapper function in your source code has the strings that you want to be translated. And then I have the package that would be able to look for in your package those strings that are being sent to custom functions and extracted to the .pot template. So I don't think I explained it super well, but the idea is it's a bit complicated but doable. And you kind of have to be careful about, because what's going to happen is you're going to have the string. It's going to be translated and then glue is going to act on it. So if the translation breaks the evaluability of that message, then glue is going to kind of err on it. I think it's doable. It's something that hopefully we can work out together in the future. Yeah, I think it's doable. It's something that I'm starting to look into now, but haven't actually done it in practice. Yeah, it's interesting to think about that just because my understanding is with glue, you're referencing symbols in your code from that message. In that case, if you were to change that symbol name, then you're probably going to have to update all the translations. Is that right? Yeah, I think so, but I think it's kind of the same for sprint F2, right? Where it's something changes. It's more serious, I think, for glue, but something similar can happen in sprint F where if you change the type from a integer to a string, then the message template will have to update. But yeah, I think it's more common to rename variables and refactor code. The message is basically the same, but the internal variable name changes and that will break your translation. You'll have to redo the translation, but hopefully the maintenance burden of that is not too high because you would just copy and paste the thing inside the purely braces in the new translation. But yeah, it would definitely break the translation. Actually, I've been using glue with PO files and it has been much more convenient compared to sprint F because with sprint F, you have references to some variables by index. And when you are switching the index in a language, then it will just mess up the translation altogether. While if you're using glue, then although you have to ask your translators to keep the English variable names, but you can switch the order of the variables as they think it looks better in a given language. Yes, we'll talk about that. I think what you're really talking about, we'll talk about in a couple minutes. But yeah, there's this very ugly thing you have to do with sprint F when you want to change the orders when it's much more natural in many languages to write the message in a different order. The way that you have to handle that in sprint F is quite ugly, but for glue, it would just be a matter of writing the variable name in a different place. Yeah, it's a good point. I guess another advantage would be that when you're going through the translations and you're just looking at the message strings, it might be more obvious or more self descriptive of what is actually going to go into that percent S, and that that might help you translate it, right? Assuming the variable is formally made. Which is another thing we'll talk about in a bit of like, sometimes it's a bit hard to reconstruct the context of what the message is without having to go back to the source code, and it becomes a bit more self documenting in the pot file itself when you're using glue. Yeah, so I wonder, one question I've been kind of wondering about, I never tried translating to myself, but it seems like there would be situations that would arise where when using sprint F or glue, substituting in a word or runtime, that word could be like in some languages that noun could have a gender or something, right? It could have something about it that affects the other surrounding terms. Like, you know, this is a bad argument, but argument might be substituted out for a noun of a different gender, and now bad might need to be formed differently. And does that ever arise? And then how would you handle that? I mean, is there a better way for us to write error messages in English so that those cases don't come up? Yeah, I think what I've run into here so far is that what you want to plug into the template parts of a sprint F message are not English words, like you basically want to avoid putting in English words to that. Because if you do that, you end up, I mean, even gender, not gender is not the only thing, but like case and the arrangement of the words might just change if you're putting words in there. So if you're putting only like variable names or function names, that kind of thing inside of the template, that makes it I think more translatable. Right. It's like almost like it should just be something that is part of the data, you know, something that would be relevant only to the user and the user to understand whatever it is, right? Like it's something from that, you know, whatever data set they're using or the context in which they're working, right? Yeah. And if any other sort of trickery, you can imagine people being trying to be clever, you know, like trying to, you know, they don't want to, you know, they're conditioning on something and they don't want to write the message out twice. So they try to be clever and all of a sudden that breaks translations. Yep. Oh, I, we actually have a PR and a table right now that someone has already done that. And they've basically taken a couple of, there's like three or four different, there's like maybe 15 total combinatorial error messages that can happen, but it's from like three or five components. And they say, okay, take component one, paste component two, take component one, paste component three, based on what the actual error is. But when it comes to translation, that becomes much tougher. Very cool. Thanks. Yeah. So all right, let's, we can continue now, I guess. Thank you. Thank you. Okay. So that's kind of the mechanics of how it works. So you've done translations. Now, what happened? So now we talk about how to maintain translations over time. So there's kind of three basic things that happen with translations over time. New error messages get added. That's like new code gets added, or you found a bug and you need to stop another edge case. So you add a new error message. And old messages removed, either you deprecated code or you refactored code, or maybe even you switched the code from being an R to being in C. And so a message disappears from the R pot file and ends up in the C pot file. And then messages get changed. Hopefully they change, hopefully the messages obviously are super instructive and informative and everybody that ever experiences that error messages knows exactly what happened, but obviously we're quite far from that ideal. So as we get like user feedback over time, we think of ways to improve error messages, either they get changed or refactored. Maybe there's a typo, et cetera, et cetera, the messages change just slightly. So the pot file is always just extracted directly from the source code. The pot file is super easy to change because you just overwrite the old one. But the dot p o file is less so because you don't want to just erase all the old translations. Ideally, you keep all the translations that say the same and figure out what to do about the deltas that happened here. So for new messages again, it's pretty easy. You can just like cap the new message ID into the dot p o file. Just put it in there. It should be easy to insert. Remove messages. Well, maybe it didn't get removed or it just got moved to a slightly different place. And the old translation is still useful. But it just doesn't fit exactly into the new thing. So they don't just delete the old translations, which might like throw some knowledge away. They get marked as deprecated, which means that they show up in the dot p o file like this. They have they're basically commented out with the tilde to mark them as deprecated. So this is from the actual R dot p o file for Chinese right now. This is a message that at some point was appearing in our and it no longer does. So somebody went to the effort translating this in case it is still useful to the translators in the future. This is left there at the bottom of the file. The next case is kind of edited messages. These are kind of marked as fuzzies when they call it. So if an error changes slightly, they get they don't throw them away totally. They get text tries to find a close match. I'm still quite puzzled about what exactly the algorithm is for finding that match because it doesn't seem to be too great. But yeah, so they look for close error messages. And if one is close is found, they will kind of try and merge the old message with the new the old translation with the new message and mark it as fuzzy. So you can as you update your translations, you can look for fuzzy ones and make sure they're okay or kind of update them slightly to match the new error message. So here's another one is marked as fuzzy in Chinese. I think probably you can't understand very well, but this actually does say this part says should be numeric. But it seems the argument that this message is for the actual argument named tolerance is no longer tolerance. It's now x. So this has changed slightly. We would just have to change tolerance to x. And now this is a valid translation again. So the idea of fuzzy translations is to make it slightly easier on you by associating old translations where the worker is already done to like close matches in the new translations to try and minimize the burden of property. Quick clarifying question on that previous. Sorry, just a message ID is what the current message is, right? So the message has been updated message ID. And so the message is what has to be updated. So x changed the tolerance. Is that what has to happen there or Yeah, sorry. So yeah, exactly. X needs to change the tolerance here. So exactly. So when it comes to a fuzzy message that you see in a dot p o file, the message ID is correct. And the message message string might be wrong. So message ID is always like reflected currently in the dot p o t file. And the message sure might need to be updated. Thanks. Yeah, thanks for helping to clarify. Can I ask another question? Yeah, sure. So the idea is at some point in the past, there was a message ID which spread X should be in America, right? Or Yeah, exactly. The old message ID. And now that message ID is still on the code somewhere, but like deprecated or is it so when a fuzzy match is fine found, then it's completely removed overwrite. Yeah, I'm not 100% sure. I think I understand what you're saying. So let me let me clarify, but I'm not 100% sure that it gets marked as deprecated. I think that instead of being marked as deprecated, it gets marked as fuzzy. So it either continues to exist. It gets marked as deprecated or it gets marked as fuzzy. I think there are three options there. So what that would mean is that this X should be numeric message has disappeared from the database. And because it disappeared, they look for a close match in untranslated messages that still remained. Does that make sense? Yeah, it seems odd that such a common looking message might have disappeared, but who knows, maybe it wound up in C or they added a period or yeah, it could have changed in any number of ways. But yeah, the X should be numeric is no longer there. Does that help? Yeah, it's just kind of weird that then you're basically losing the old translation. If the fuzzy match was wrong, then you're kind of overriding something else or maybe yeah, I don't know, it doesn't sound like such a good idea. Perhaps add it to the bottom and keep the old one. Yeah, I wonder if there might be an option in Get Text. Just shut off fuzzy matching and only mark old translations as deprecated. It makes sense to me that such an option exists, but I'm not perfectly familiar with the man page to say it, but it does for sure. Because these automatic checks of fuzzy or removing, who does it? There's an automatic tool that X this. Yeah, so there's a the tool is called MSG merge. I'll try and just type that here. G merge. MSG merge and this is like part of the Get Text toolkit. And what message merge does is take an existing PO file and try and merge it against updated POT file. And presumably there's a ton of options associated with that. I think one of which is probably helping to dictate how to do matching of new messages. Okay, thanks. Yeah, there's a no fuzzy matching flag. Perfect. Yeah, so that would probably be more to Elio's preference would just be don't even try to do fuzzy matching. Just throw it to the bottom of the file. And I'll kind of shift through that myself and keep those in mind. I haven't worked extensively enough with this, but in my POT tools package, I tried to say deprecated messages at the start of translation just like spit them out to the translator so that they're kind of surface to you before you start so you know what's kind of available to you. And then if a message is marked as translated as you're translating will kind of be put side by side and say like this previously translated as something like this, maybe just want to copy paste. Yeah. So Mike, maybe I should know this, but does his message merge executed as part of the make PO step that we do to update every messages every every our release? I don't know about the make file, but I know it's part of update package PO, which is the tools, right? Right. So we run that via make and then it so it does do a message merge. Yeah. All right. So I don't know when exactly does it should be after the POT updates and then it should do message merge for the old PO files. Yeah, right. So it's something where yeah, this, this is going to be introduced, you know, whether the translators like it or not, these fuzzy messages will be introduced when we update the messages. So more important to have people, you know, look at things before release to make sure things are good. Yeah. Maybe, maybe it's a discussion to have whether our itself should add the no, no fuzzy message flag, but so far I guess we haven't heard anything complaints too much from it, but also the like for example, the Spanish messages haven't been updated in 10 years. So maybe it's one of these things though, just making it easier for people. There's definitely some learning improved area in terms of the translations. Speaking about not outdated for 10 years. Do you folks happen to know what happened to the PUTL or I'm not sure how to pronounce that the service so that you could have edited PO files as part of a web service at translation.rproject.org, which was live like 10 years ago, but it's not accessible anymore. That's the first time I've heard of this. If Michael had never heard of it, then yeah, maybe Brian had some server doing this at some point. That's right. Yeah. He must have had something set up. Yeah. But that's the exactly the kind of thing that, you know, Mike's, our working group that we're starting up is going to be looking into, you know, figuring out how can we streamline the, you know, the translation submission process and stuff like that. Yeah. Yeah. Maybe I'll follow up with you about what the experience was like with that. It'd be great to learn about them. Let me just write that down quickly so I don't lose it from the chat. I can also share that later. I just remember that there was this service running 10 years ago, but I was checking on that yesterday and it's gone. Okay. I will continue again. Okay. Yeah. So we talked about how the maintenance burden of translations looks like. So now a couple more details about locales and domains. Yeah. Most domains are two-letter ISO codes. You may be, I guess, by now familiar if you've used your native language on your computer enough with what your own code is. Indonesian, for example, is ID, Turkish is TR. Now, they also have dialects and basically how to handle dialects for translation. The most common case for this that I've come across is usually for Chinese, which Chinese you may be familiar has simplified in traditional. So the way that they usually do that in translation systems is to do simplified Chinese as ZHCN, which is really like mainland Chinese, which is where they use simplified the most. And then ZHTW is for Chinese in Taiwan, which is where they have traditional Chinese the most. Presumably, there could also be like a ZHK, which would be like Hong Kong Chinese, which would also be traditional Chinese. But I think the dialect is substantially different there. So maybe the message would actually look up different. But for now, I've only ever seen this be like simplified versus traditional. And simplified is ZHCN and traditional is ZHTW. Some of the other like huge diasporic languages in the world would be more common to see actual dialects like Spanish. They're Spanish for Argentina, which is ESAR. They're Spanish for Dominican Republic, which is ESDO. Spanish for Spain is ESES. And in Arabic too has a ton of dialects. So Arabic also has like the Tunisian subdomain. There's one for Saudi Arabia, one for Syria. Yeah. So those huge diasporic languages have some scope to do dialect translations as well. So I think we'll talk about how that actually works. So suppose you are in Argentina running in ESAR, which message actually gets shown to you? So when you hit the stop message, you're not supposed to be here. You've done something wrong. What's going to happen is first, it will get looked up in the specific RESAR.MO file, which is the one most specific to your current session. And if there's no translation for that message in the ESAR.MO file, we go to the less specific version, which is just general ES.MO. And if there's none found there, then it just ends up untranslated. So whatever the original message was written in the source code, it will get reproduced. Yeah, which for the vast majority of cases that will just be in English, but there's in principle nothing stopping you from writing all of your messages, say in German in your source code. And then the German message would be reproduced. Yeah. That's basically how it works. There's a lot more complication there that I don't think 99.9% of people use, but that's the basic idea of how things work. R itself is using that to do British and American English split. So anytime that there's an English like a spelling or certain, I guess there's a couple of dialect things, but it's mainly about spelling like gray and favor. There's an ENGV MO file and things get looked up there first. And if not, it's just returns the general English translation. And then lastly, this is just why there's multiple POT files that will come across. So we'll start seeing soon. Basically, it gets split by where the message came from because the tooling for how to create that the POT file is different in the first place. So there's one that corresponds to the R files and one that corresponds to the C files. I think we talked about this a little bit already. So I'll go over it again. Okay. Another pause for questions, in case I have accumulated. So where are those files? Because I'm just trying to look for them in the source code. I downloaded it from SVN. Yeah. Let me try and share a different tab. Okay. So I'm just going to browse on GitHub real quick. So .po and .pot files are in search library PO, and they're all here. So here is Farsi, for example. Here's Brazilian Portuguese. The POT file is r-base.pot, and then the C translations for base is called r.pot. And then all of the default packages have their own .po directory, like stats. Here's stats.po. There's rstats.pot, and then there's stats.pot. And then when you, this is going to be the fun part. Okay. Let me see if I can share. It's probably too small for you guys, huh? Can you actually see? Can you see that now? No, it's a bit much way. Yeah. All right. So yeah, this is my word from Peter. So you're seeing we use our Google. For our session, it's in this library translations directory that they end up in. These would be all the MO files. So yeah. So on the actual running arm, here's all the .mo files that come for all the default packages. I found a Farsi one. One of the Farsi Yeah. Yes. Yes. Don't never create one. So yeah. So that's kind of mechanics of translation. That's like as a developer, how to think about translations. Ideally, as like a translator, a lot of that friction would go away. That's kind of my goal for the PoTools package is to try and get rid of a lot of that knowledge that it's like a lot to know all this dot, all this get tech stuff, which if you just want to translate stuff is a bit of a lot of overhead in my opinion. So focusing on translation itself, there's still a whole other like set of kind of institutional knowledge things to try and start thinking about for how to make translation easier, how to make translation work better. So first, we'll just go over a couple of things and there's a lot more, of course, but just like a couple of big things that I've certainly come across. One is template translations like first printf. We kind of talked about this before. You would see a message. This is a message from R, %d, which means a number, an integer, arguments passed to internal %s. This is going to be a string which is going to name some internal function, which requires %d. So you ran into an error where you tried to pass some internal function, the wrong number of arguments. The translation that you do for this, it's going to have to have %d, %s, and %d in that order, or if you have to rearrange them, which happens in a lot of languages that just have different syntactical structures than English, you would have to number the inputs. So here's how this actually looks in the Chinese translation for this message. We see that internal is still getting past the string and it's the second input. So that $2 s, that's what this $2 means. The first input is still a digit. This $1 is saying that this is the first input. And then $3 is saying that it's the third input. So this is kind of what we were talking about carefully before, where the glue version of this, it would just be brace, like internal name, brace, received arguments, brace required arguments. And you can just rearrange those templates within the translated string. But for a Sprint F, you have to bring along these $2, $1 and refer to the argument order, as they would be received in the code itself. So there's like this extra overhead of trying to keep track of the order of inputs as you translate into a more natural order in your language. I guess that if you're using the glue idea, the problem there could be that you lose the, if it's an integer or a number or a string, which could be used for translation. So I think the idea is that the glue version is referring to a variable. And as the type of the variable changes, the behavior of glue kind of dictates whether the change happens or not. So it's kind of independent of translation. So yeah, you don't have to keep track in glue of what type the variable is, kind of glue does that for you and whether the glue executes code or not. Yeah, here, it's like two things you have to keep track of when you're doing the Sprint F, which is both the types of the variables and the order. Yeah, but at four translators, maybe knowing, for example, if you get this string and it says brace arguments and brace arguments pass to, you don't know if that's like a string with the name of the arguments or the number of arguments, which might be important for translating. I think you need to provide some comments for the translators anyway, if it's a complex message to be translated. At least that has been my experience. It's really great to use self-explaning variable names, but at some point, that's just not enough. And getex provides ways to pass comments for the translators with the hash dot prefix, which is, I think, really, really useful in such cases. Yeah, that's something that the R tooling doesn't have access to at all right now is those context specifiers. Yeah, that's something I think would be nice to start integrating into R and is figuring out how to make it work to pull along these comments to show up, not just in the source code, but also in the .pyot template if you want to leave a context clue for your translator. Yeah, so that's something that's in the getex tooling, but it's not integrated yet with R itself. Good question, Mike, for the templated strings here. You showed the example where kind of the order, the order of things changes because of Chinese syntax. We might have to kind of use these like index prefixes if I can put it that way in the, let's say, original language, or is it simply sufficient in any translation that changes the order of appearance of the arguments that you use these kind of indexed prefixes. Yeah, so there's a couple things. One, like you can get away with just completely ignoring what the translator has to do here. And yeah, so this is perfectly valid. Like no, I call redirects these $1, $2, no redirects at all in the original message, and then only the translators have to worry about what the order is. Two, technically you could just write $1 D, $2 S, $3 D. That's perfectly valid R code and S print F code. And I had no idea about this until quite recently, but sometimes it can be quite useful if when you're running sprint F, if you want to reuse the input, instead of writing comma S comma S comma S, you just write $1 S, $1 S, $1 S in your actual code and reuse it. So the redirects are valid in the original translations, but yeah, it's very atypical. I don't think R uses that a single time in the source code, if I remember correctly. So it's quite uncommon to actually do that in the English code. And I'm not sure if it would even help the translators to do it. So next is plural translations. I'm not sure how common it is to use and get text in the first place, but it is there. Yeah, so here's an example. There is no package call versus there are no packages called. So this is an error R produces when you try and load packages that don't exist. Depending on how many packages that don't exist, you try to load. If you only tried to load one, they would say there is no and N would be one. There are no N would be any number other than one. And you put that here and that would show up as message two. We talked a little bit about how that shows up in English and this would be the message shares for the .pot file. But for translations, lots of languages have lots of different approaches to pluralization. The ones that I know the languages I like more or less and familiar with are all like English or like East Asian languages, which East Asian languages have no plural really. So there's this metadata about meta parameter about language that kind of dictates what the pluralization structure is for different languages. And for the East Asian languages, it would be only one message here. So regardless of what the number of packages is that were tried to be evaluated, the translation would be the same because there's just no real concept of pluralization in Chinese or Japanese. And in all of the Romance languages, Spanish, Portuguese, English, they all use the same basic pluralization structure where it's either singular or it's not. But there are other languages which are much more complicated and mind blowing the complicated. Arabic, there are six different ways that you can get pluralization. I'm not sure if our Arabic speakers, how cognizant they are, even of that being the case or if it's just something that's supernatural to them, that as you speak, you do this polarization without even thinking about it. I think Polish also has five different ways of doing it. Yeah, so there's a basically parameter and I can show it to you later, but it would dictate based on what N is, which pluralization to use, and they would kind of go in order and you just have to know when N equals this, it corresponds to this message based on what index this parameter evaluates to. I can talk about that more later with the people for whom that pluralization is more complicated for their language, which I think would only be Arabic on this call. Yeah. Yeah, there's a, maybe I can just show this. I think we're doing good on time, so I'll just take a pause to show this quickly. So Nome is one of the Linux distributions. They have a really good ecosystem for translation going on. I refer to this site all the time, but their Arabic team shows some stats about how translation works. And then at the bottom here, you see this really complicated. This formula dictates how translation or how pluralization works in Arabic. So there are six plurals and you apply this arithmetic to the input N and whatever the output is, that's the corresponding number index you use for the translation. Yeah, so compare that to English, for example, where it's as simple as like N not equal to 1. So a couple more things like special characters, new lines, if the message ID, which is the translated message, if it starts with a new line, the translation has to start with a new line. If it ends with a new line, the translation has to end with a new line. New lines in the middle, I think you can kind of do whatever you want with. They don't even have to match in number, although they probably should. Tabs, you can do whatever you want. They don't seem to care whether you match tabs or not. And certain other things like a character turn, this like vertical tab, these things I think aren't allowed in translated strings to begin with. So you shouldn't have to come across that. But if you're trying to translate a package and want to mark a message with an R for translation, it's not going to work. Yeah, punctuation, all languages have like different kind of slightly similar punctuation or different ways of quoting or different ways of brackets. You can kind of just make whatever, I mean, at the end of the day, the translations that you're doing are kind of communicating to your users. So whatever you think would be most natural to them, use that. Try and be consistent. I think that is tougher to do. And if you're going to use the ASCII double quote, you have to escape it. This is just to call out, if you're not super familiar with like C or Python, and the ability to kind of concatenate strings across lines by just writing quotes. That's how it works in the .pot files and .po files as well. That as long as you keep writing the quotes line after line, it'll all be considered one message. So here, everything between quotes on all these lines will get concatenated with no spaces directly and it'll just be one string. So this is the way that they make it so that there's not like massive strings, dozens of pages wide on the single line. This is the longest message in the whole R database, which is one that I think we're all familiar with. This is the one that shows up when you first launch R. And it's like scary to say to people, I think. Here, Getex also truncates messages at each new line. So every time there's a new line, it'll wrap it to a new line. Coming to like technical terms, like mean as a statistical concept and like namespace as a computing concept, these things often you can just leave it untranslated if it's technical enough. But if you do want to translate it, there is a glossary of technical and statistical terms here that is referenced by the RR translation manual. My go-to is to go to Wikipedia. We find the Wikipedia page for namespace. We go to the other languages and see if it's translated into our language and just copy and paste from there. One thing that the data table team that was useful is started with glossary. Basically, if you're translating as a group of people, you have a reference list of translations for more technical things that as you come across them, you kind of agree what the translation is and just mark it down and refer back to it in the future. Now, so as you're doing translation, what's going to happen is you're just going to be in the .pot file and there's going to be a string and you might have no idea what the string is talking about because there's these template things. You have no idea of what is trying to be communicated. So how do we try and return that context? Gurgely mentioned that GetText does come with a way to try and provide hints to users, but that R is not equipped for that yet. So these things are kind of just like in-the-void strings that you have to figure out how to translate. This is not a great example because it's kind of clear, but you can find any number of strings within the .pot for R that are like, I have no idea what this is talking about. How do we figure out what that context is? We do get these slash colons that show us where in the source this message came from. So if we go to this file on line 826, we will find this error message or warning looks like an error message. And this error message is repeated a bunch of times. So it'll also show up at searchmain.util.c, line 859. It'll show up at search.main.util.c, line 900, and all these other ones. So this message is used a bunch of different times, and we can try and go back to the source code and find out what the context was that this was used in from reading source code. Hopefully that's helpful. There are any editors or IDEs that would make it really easy to jump to that place in the file from the profile or profile? Maybe. So I know that Emacs has a .po mode. I have not been able to figure out how to get it to work yet, because I'm very weak at Emacs to begin with. So it might work with Emacs, although whatever the tool was, we'd have to know where this file name is relative to, which this file is going to be within search library base .po directory, and this SRC is relative to the top level directory of R, so it's not relatively correct. So we'd have to have a bit more context to that tool to know what this file name is relative to. And the translation guide for R also mentions, I have to look it up real quick, it mentions another tool that might have that functionality, but I'm not sure. Yeah, I was thinking maybe later on as part of this working group or whatever, we could convince RStudio to kind of hard code some conveniences into this. So if I'm able to facilitate translation, you'll find the context of the code. Kbabel, I don't know if anybody's familiar with Kbabel, but that's the other tool mentioned in the R Translations guide, which might help, but I don't know, I haven't used either of them very successfully. So yeah, for me, when I'm doing this, I just have Adam open with the directory and just poke around in the source tree. It's usually not too hard to find, but it would be nice because these lines, these files are huge, right? So like getting to line 900 is also kind of like scroll, scroll, scroll thing. Mike, is there any way to kind of embed like a comments in here? I mean, I guess, presumably the way the Po files, the PoT file is generated, probably not. But I mean, is there like a few, if you could leave for posterity, like a comment about, you know, some of the context of a comment that might aid in translation in the future? Yeah, I think Gerdley is actually quite familiar with this. He said he's used this before. Gerdtext itself is equipped to do this. It's not something that I think R is natively aware of in its build system yet. So I don't know if you wrote those now, those comments, whether they would be erased by the current build system for R. But in principle, Gerdtext has that functionality. I just don't know if R is equipped to be able to handle that. And I think maybe that's something we should invest in going forward. But it's certainly available. Is it something that's embedded in the string itself, like some kind of comment that you embed in the string? No, I think maybe Gerdley should talk about this, because I'm also only fairly familiar. So yeah, we'll ask Gerdley. Yeah, so in the tools package, there's the xgettext function, which is used, I think, even in Bezar to extract all the strings to be translated. And those are put together into a PoT file. And this function doesn't have the option to look up comments. And what I did for some of my packages is that I patched the xgettext function so that it not only extracts the string to be translated, but the function, which has a string to be translated like stop message or warning, it has another argument, which can be a comment for the translator. And it's extracting that and putting to the PoT file. So currently, it's an ugly hack. And I hope there could be much better ways to solve this issue. And the best way to do that would be probably to do that in the tools package. Okay, so it would require a change to the extraction. So we'd be pulling out, say, comments above the call to get text or something. Yeah, one can imagine like another, like our oxygen type comment markup that gets recognized as translation comments. And if they're next to a warning or stop call, they get associated to it and extracted with the tool. Yeah, I mean, I guess, ultimately, we should be writing code that is so clear that, you know, it would be best to just read the code, right? I mean, that's that as coders, that's what we're trying to do. You know, comments end up, you know, creating cruft as everybody knows, right? As you as the comment to the code gets out of sync. So I think ideally, I figure out a way to communicate, you know, to get people so they can pop into the code and having the code be obvious enough as to what it's doing. That won't be the ideal. But we'll probably a little bit. I'll say with with data table, we had like 1400 messages. And there were maybe 10 to 20 of them where the translators were like, I have no idea what this is. And then they told me and I was like, I also have no idea what this is, which, you know, it goes back to maybe we need to refactor the code itself to be better. So yeah, that's like another side benefit, by the way of doing translations is that this interplay with like thinking about your error message database as being something you need to actively translate actually improves your messages by feeding back to say, well, this was a terrible message in the first place, we should improve the English version of this. Before we can even translate it, it needs to be better in the first place. Yeah. And there have been translators who have done that in the past. And it's actually been really helpful. And so yeah, it's what it encouraged everybody. As you're contributing these things, feedback is very welcome. You know, if you think that you understand what something means in English, or you got to be sure of the message is even being admitted properly. I mean, those types of things, we can catch those and yeah, I agree. Yeah. And I have one more comment that I had was that the Poe tools goes slightly towards this direction. Already by what I do is I find the message as it was. And instead of just the string, I at least include the whole call that produced the string. So it would include the stop or the warning of the end got text for the R side. And on the C side, it would include, so you get to see all the arguments as well for the template and strings. It helps a little bit, but it certainly would be better to just go and click and jump to the full source code immediately. Just one more way to find the context is to just use graph for the source code. Sometimes that helps too, especially on the R side. How do we actually, so given that there's these constraints, like the templates have to match and you have to make sure you escape your double quotes and match your new lines. It just creeps in sometimes that you have these mistakes or you miss the certain template or you switch to a different keyboard and the percent sign was not the UT, the ASCII percent sign and it was the percent sign in another language. And now the format didn't get recognized because you typed the Chinese percent sign and you went back to English and now it doesn't get recognized. These kind of things just kind of creep in. So one way to check validity is with this command line tool, MSG FMT, which is message format. This is the tool that's used to create .mo files, but you can kind of just throw away the output and just run it and it will give you errors if there are any. Mike, are you advancing the slide? I'm still seeing the one about the graph. Oh, maybe if you went back to the presentation mode, like your... Yeah, let me like force refresh and are we seeing the message format slide? Still the same. Maybe you're sharing the other window. Yeah, maybe I need to stop sharing. I think I had more than one tab over. Okay. Hopefully we can see it now. So yeah, hopefully it's showing message format. Can you guys see that? Yes. Yeah. So just to show an example invocation of message format, this .she gives you better feedback. Your input is this .po file and I just set the output to be centered out to throw it to your terminal. But yeah, that'll just be throw away and hopefully there's no errors. So if you want to actually get the messages to appear in our session, that's a lot more involved. You certainly, if you have the SVM copy of Ardevel, you would be able to just build like normally and hopefully it would show up just like that. If you have to have the .po file in the right place in the R source directory, but once it's in there, it should install correctly. My way to get the right language to appear is just to set the language environment variable temporarily before starting R. This is different on Windows. I haven't been able to find a consistent and sticky way to make this work on Windows, but the basic idea is the same that the language controls it, but I have no luck on Windows getting that to work. So I'm sorry, I just have not enough familiar with Windows for how translations work. You know, the testing piece I'll just mention, make sure that you take advantage of this utility and test the translations prior to submitting them, because otherwise, they'll just be a lot of back and forth as I try to get things to go. Speaking from experience, I take it. Yeah, so just making sure that there's no errors, like very simple stuff. And in my experience, well, maybe I'm just staring at this stuff too much already, so I don't know. But I do think that the error messages you get back from this are useful enough to be actionable. So it's not something that gives you total gibberish. It'll give you some line numbers and say what had come wrong. And if you're familiar enough with these common foot, like common pitfalls of translation, it usually nails itself down for you right away. Yeah, so we didn't get to use it today, because the scope of the workshop, I think, was a bit different. But in general, I think this po-tools will already be helpful for doing translations for your own packages. The big blocker to doing it today is that, as we saw, like, Beaches has too many messages. So like, there's no way we're going to do it today in one shot. And in general, to do it in one shot. So it's not super well-plipped for doing like chunks of translation yet. Or, yeah, like picking out a certain number of messages to translate or focusing on a certain subset of messages to translate and working with them. So, yeah, use this for your own package. It has a lot of tools for treating your messages in your package like a database and analyzing it as such, like diagnosing whether there's typos, you could do that. Yeah, I think it's just helpful to have a mindset of considering the messages in your package's database and po-tools should facilitate working with that. Okay. That wraps up the discussion of how translations work. I think we'll take maybe what we have until another, what, another hour and 45 minutes, right? So we'll take about 15 minutes break and just rest, stretch your legs, stretch anything you want, take a bathroom break, get some fluids and we'll come back. I'll stay here for another like couple minutes for any questions that come up and then, yeah, I'll stretch myself. Is there any way to build air from source with the new po-files without all the complexity of building air? I mean, there is some kind of shortcut. Yeah, so I was just thinking about this today and I think you can do it by just generating that .mo file and injecting it into where the .mo files live already. I can't promise that it won't break anything, but I think it would be a way to get away with not having to fully rebuild our and still see whether the translation shows up like you'd like. I think it should be relatively safe because I understand that the .mo is really just a lookup stat. So if you break things, it shouldn't like cause a segfault or anything like it really should just cause the lookup to fail and then the regular translation gets produced. So I think in the case where your translations are actually working correctly, it can signal that it has worked correctly. I can't promise that it'll tell you that your signals whether what has failed is that your translations aren't working correctly or that this doesn't work. And so waiting to get around rebuilding our, I can't say off the top of my head whether that's true or not. Yeah, I think you're already going to be likely if you're submitting translations that are there useful for us anyway, they're going to be against the latest R-Devel, right? The later the better, right? So ideally you're already getting it out of SVN and building it. And so well, yes, there probably are hacks and things you can get around. I think ultimately to do it the right way, you're working on a check out of R and going through that complexity. But there is actually another workshop on how to build and contribute to R tomorrow. So you can check that one out. Yeah, for me it's like already I have R there and once you have R and it's been made already, then the remake usually is pretty fast. So just to add translations probably would not take too long to compile, but that learning curve to get to the place where it actually makes correctly in the first place is pretty substantial. And the other thing I was thinking about this morning was that yeah, in principle you could just get clone the git version of R-Devel to your translations there. I never really have to actually make R in principle. But yeah with and if you're using message format correctly and like checking that the translations file is formatted correctly, that could get you away with not having to actually install R from source, I think. Just as a way to lower the barrier to do translations, I think that should be possible. Okay. And did you share the slides or somewhere? Oh, sorry. Let me put that in the Slack. I meant to put that in the Slack. Oh, can I stop? It will be easier to copy the command then. Thanks. I meant to share these guys a long time ago. Sorry. Let me know if I messed up any of the permissions, which is always my bug there. Seems to work. Okay. Okay. So it'll be slightly different for the Spanish crew and the rest, but the basic idea is to do a couple of translations here. For Spanish it's a little weird because there is existing translations. But for the rest, I kind of compiled a small set of famous and like infamous translations that maybe are the most common or most well known from R. And we can translate those since they maybe are the most salient. And for Spanish, there are C translations, but no R translations. So you'll do like half and half of like updating old translations and making new translations for the R side and stuff. So yeah, doing those translations is one thing. And then hopefully we can get to a place where we can get this ready to be patched into R. So Mike Lawrence is on our court. He's enthusiastic about the possibility for that. So I think it's something we can get to. So for the, you guys have these slides now. So I think these links will work for you. For the new languages, download this file is the extract from R base, which is the R side messages. And this R comment dot pot is from the C side. So there's two files, each of them have about 50 messages in them. And then we will run message in it to create the initial .po file. We will update the metadata to make sure it's actually in the right form. And then start doing translations. Or I think there's no way here that actually I tended doing French, but for Spanish, so they originally were translated, I think like 2011. So R is very old, but very stable. So a lot of the messages still work, but there's of course been a lot of updates since then. There is no translations for the R side yet. So we can do the common translations from the R side and just do updates for the C side. So on the C side will be this file, which the old file is in some weird encoding. I tried to do an extract of it into UTF-8. Hopefully we don't have to spend half the time working on a coding. So hopefully this works already that this one is in, it's called to do for it's like the the translations that need to be updated or are just new. There's about maybe 300 in here of those. And here's the UTF-8 version of the .po file. So yeah, once if we get any progress on that, we'll be able to combine them with this. And we'll talk more about details there once we do the breakout. So the everybody that's working on an individual language, which is everybody about the Spanish speakers will stay in the main room and we'll work on how to do a new language according to this. And then for the Spanish speakers, they'll be in a room and able to converse with one another, figure out how they want to work and working in a slightly different way. So I will do the breakout rooms down. Mike, I was just asking to Arthur. I have one question, technical question. Sure. Can I share my screen? Hopefully you can. Unfortunately, I was booted from being hostile. Okay. So far, is it is it too small? I don't I don't know how to make this bigger. So I'm using the pull edit. And this one here, it has the like the source text is like the singular one and then the plural one. How do I type it in the translation part? Because when I press enter, it just makes the new line basically. This I don't know. I think this is something I'm not sure. Maybe so in Bahasa, I think there's only one plural. It's hard off. Yeah. So I think they only maybe expect one message here. Okay. Yeah, it should be okay. Just to wrap up quickly. Thank you, everybody. First of all, I hope you learned some stuff. Hopefully this also, I guess, more importantly, sets you up to be able to continue some work on translation in the future. I know Eleon and Paolo are talking about maybe doing a hackathon of users in Argentina for to continue translation as a team. That would obviously reduce the burden a lot to be able to divide it over 10 or 20 people. So maybe the Middle Eastern and Indonesian contingents as well can brainstorm on how best to go about setting up the hackathon and not have to just put all of the translations on your shoulder. For next steps, you can either send me now what you have progress on or just as you have time to work on it, send me what you can come up with. What I would want is the .po file that comes from your language. And at the top, there's a metadata field. So if you scroll to the very top of the .po file, there should be something about metadata. You should write your name as the last translator. And make sure things are up to date there. There's one field about the encoding that says either Keraset or it says UTF-8 there. And I think that's basically it. Try and test out what the command I left in the slides about MSGFMT to check that the file is in the right order and syntactically doing okay. And yeah, of course, feel free to reach out to me. If you're stuck anywhere, I'll help going forward as well. That's it. Yeah, so send me the .po file over email or zipped up and sent over email to try and, if you zip it up, it helps remove any issues from encoding that can happen when you send over email. So yeah, just send to me. I sent the email in the Slack channel. And enjoy the rest of the views are. Thank you.