 One more talk before lunch, and I'm pleased to introduce Heather Rivers. She's from Yammer. She got her start doing Ruby at a startup weekend and had been doing Python and discovered the true way. Wow, you just throw a little antagonism in there and people love it. Okay, so Heather's going to talk to us about internationalizing your app using Ruby. So thank you, Heather. Thank you, Josh. Hi everyone. So yeah. So yeah, I'm going to be talking about crowdsourcing localization with Rails. If you've never paid much attention to it before, localization is a surprisingly tricky problem. So computers just aren't quite ready to handle all the complexities of natural language yet. We humans have had hundreds of thousands of years to get good at language and computers are still playing catch-up at this point. They probably will be for a little while to come. So we can't even figure out how to make them understand one human language, much less many of them. But sometimes we need computers to be able to serve content in different languages. So somehow we have to find a way to bridge the gap between their understanding of language and ours. So that's what I'm going to be talking about today. But, okay, before I do that, I just need to set expectations for this talk a little bit. A little nervous. All right. That's so much better. All right. So now on to some definitions. So a locale. In the context of translation, a locale is a combination of a language like Spanish and a country where it's spoken. And we generally represent it like this, just joined by an underscore. Internationalization, which is a long pain-in-the-ass word, which we sometimes abbreviate to I18N, because it's like the number of letters, is the process of abstracting anything that's locale-specific out of the software to prepare it to be translated. Localization, not quite as bad a word, abbreviated L10N is the process of actually providing those translations once you've already internationalized. So for the purpose of this talk, a token will refer to a placeholder in a translatable string. So friend tagged you at location. I'll just use curly braces, but notation varies. So tokens might look simple, but they're actually the hardest part of internationalization. Let me give you an example. So let's say you're building a social app that allows users to buy and sell items from each other. I am going to get sued for this stupid screenshot. So let's say you start off letting people sell a small number of predetermined items. So you might display something like this in your activity feed. So this format won't always be grammatical, because in English, articles can be either A or N depending on context. So yeah, that's no good. So you might start to try to solve this by writing a method that detects whether the word begins with a vowel, but this actually won't work either, because there are just tons of exceptions to that rule. So then you might decide to just store the articles for each of the items, but you can probably see where this is going. This is also a terrible idea, because this is a closed set. But what happens when you open it up and you let users enter whatever items they want? Suddenly you can't just store the right articles anymore. So you might come to the conclusion that you're going to have to get users to enter the articles themselves through some kind of form like this. And in this case, there's no real benefit to parsing what they provide, right? You might as well just store that as one string. So if we take it even a step further, if we end up storing them this way, it's going to be a lot easier to store them in other languages too, which have different rules for the articles. Like in French, it depends on the grammatical gender of the noun. So it's going to make it a lot better for French too. So the direction that we're going is a really important concept in translation. And what it boils down to is you should always store the highest level of representations of translatable texts that you possibly can, because otherwise you're going to end up projecting languages you happen to speak into your schema and then you're going to be locked into that and that's no good. Okay, so the other important takeaway here is that you should let your users help you. They already know the answers. So if you can find a way to extract them, your life is going to be a lot easier. So that's fine. We've temporarily solved this one little case in this one language, but our problems are going to get a lot bigger than that. Okay, so translation is conceptually simple, right? Without tokens. You just take a translatable string, you add a human brain with knowledge of language rules, produce a translation and store the results. Not too hard. So when you add tokens, things get a lot trickier. Suddenly, the translator can't apply the rules for every single possible value of that token. So now instead of just passing along some translated text, they also have to pass along some representation of the rules in their heads that they can be applied later. So somehow we have to figure out how to store those rules. And that's where things get interesting. All right, so you might be thinking localization sounds hard. And yes, it is. If it were easy, every application would be available in every language. But in fact, language alternatives are kind of rare in software. So even Facebook didn't start supporting a second language until it had well over 50 million users. So, yeah, I like this Red Bull and Wine thing he's got going on. And then a nice little fuck you. So, okay, so if they didn't start until 50 million users, why should you bother tackling it now, right? All right, this is the pie chart part. I'm sorry. According to a few recent surveys, English language speakers account for only about a quarter of total internet users and just barely ahead of native Chinese speakers. So there's a very sizable market out there that prefers to speak a language other than your default, even if it's English, which is, you know, the most common. So you might be thinking that there are a lot of people who speak your default language nonnatively, and that might be true. But according to a recent study, when people make decisions in a foreign language, their decisions tend to be less rooted in emotional reactions. So by making your users read a foreign language, you're limiting your ability to tap into their emotions. And that's really the only way to make people fall in love with your app. There are a lot of reasons that we delay internationalization. You might be afraid that it's just going to take a ton of developer time. And that's not necessarily the case. If you're not ready to devote a ton of time, there's still some simple options that will get you, you know, most of the way there. And we'll talk about this, but there are ways to do it without changing your application too much. You might also be afraid that it's going to cost a ton of money, you know, professional translations, that sounds scary and awful. But again, not necessarily. So there are a lot of services out there that that really take out the overhead that costs so much money and bring the price down to something manageable. But another way to do it is, you know, crowdsourcing translations from your users, which can get, be pretty much free. Yeah. So if you're starting to think that maybe there are some good reasons to internationalize, it turns out there's some parallel benefits too. So it forces you to design better software and to do things you should be doing anyway. So you guys, I don't need to tell you, you know, it's a good idea to keep your content and your code cleanly separated, right? So you don't want a logic, a lot of logic embedded in your templates. And here's how internationalization will help. So how is content transferred to a user? Well, first you have to store it somewhere unless your app is just a random garbage generator. And then you have to select it. Somehow you have to decide which stuff that you have stored that you need to pull for a request, right? Then you need to transform it sometimes, just, you know, like change the data, prepare it. And then finally you need to actually present it, bundle it all up and send it to the browser. So let's use Quora as an example. They display texts like this on their site, like user promoted, question promoted by a user to some number of people. So if this were a Rails app and they hadn't done any internationalization, their template might look something like this. So this technically works, but it's, unfortunately, this template file is responsible for an awful lot right now. So it's serving as the primary storage of the text that will be displayed to the user, as well as the implicit selection of that text, with embedded transformation rules where, you know, capitalize and then pluralization rules, those are language-specific transformations. And then finally the presentation too, because this is ultimately all intended just to go straight to the browser. So it's not ideal. It's all in one file. So what internationalization does for you is it enforces this nice stratification. So suddenly your storage, your, you know, language content has to be stored separately, whether that's in YAML files at database or some external service. Selection is just a matter of locale detection and then pulling, you know, the appropriate translations depending on that locale. Transformation is just a matter of interpolating the data into the translations that you've pulled. And then presentation is a really simple matter of just delivering a user-facing template. So now in this scenario, all the crazy language idiosyncrasies are stored completely away from the template. They're totally isolated to the storage layer, so they're not going to, you know, cause a big mess in the rest of your app. Only the intent is really stored in the template now. This basically just acts as a key or a hint to developers and translators about what will appear in that place. So it's just an easy way for them to tell. So that's one parallel benefit of internationalization. And you can actually benefit from it even if you're only supporting one language. So another potential benefit is a simpler UI. Localization increases the effective cost of every single word on your page. So it really encourages you to use fewer words. So that can force you to rely on symbols more, which can be much more intuitive for a user. Like a well-placed, widely understood symbol. It's just often a lot better experience. So regardless of how many languages you support, it can end up resulting in a better UI. But you should be careful when you're relying more heavily on symbols like that. Because if you're building an international user base, like, for example, when Apple introduced their first trashcan icon, Europe thought that was a mailbox. So that's a pretty bad mix up. You just want to be careful of things like that. Okay. So another benefit is it encourages you to ditch images with text in favor of semantic markup in CSS. And it's a lot more work when your images have locale-specific content. So you're definitely going to want to do something like that second example instead. And this has a bunch of other benefits, like better accessibility and more flexibility, all sorts of stuff. So speaking of flexibility, localization will also encourage you to keep your page layout really flexible. So when you're only supporting one language, it's really easy to overlook when your layout is really inflexible and only supports your current content, right? But when you're supporting multiple languages, that your text, the size and shape of your text can change dramatically. So you end up designing your layout to be a lot more flexible and handle different content gracefully. And again, that makes everything better in every language. So it'll also make sure you're being really careful with your encodings everywhere that you store a display text. This is just, I mean, obviously, right? General best practices here. But you can kind of overlook it if you're only supporting, like, English. So just definitely always have a content type and a metatag and always be really careful about your database encodings, and you should be fine. But if you don't do that stuff, you're going to end up with something like this, which is called mojibake. It's basically the unreadable characters that result from using the wrong encodings. And I really love this word. It's an awesome word, but it means a terrible thing. So just don't do that. Be careful. Okay. So once you've decided, it's time to expand your linguistic offerings. You need to figure out which languages you like to support. So you probably have some kind of analytics tool that will tell you where your users live and what languages they speak. If not, Google Analytics is a really easy way to gather this information. Or you can probably just rely on standard HTTP headers if you want, just like accept language, user agent. So once you have a potential list of target languages, it's time to evaluate the relative difficulty of supporting them. And to do that, we get to talk about one of my favorite topics, which is linguistics. I haven't just studied it, so this whole thing is actually just an excuse for me to nerd out about linguistics. Okay. So it's easy to code yourself into a corner if you don't know the scope of your problem. So let's just take a, let's just do a quick survey of some curveballs that languages might throw you. So a morpheme is the smallest chunk of a language that carries some kind of meaning. And this is the part of the presentation where I start to look like a crazy cat lady, but in my defense, I was cat sitting when I made these slides. I just, you know, cats on the brain. Lots of kittens coming up. So the word cat contains one morpheme and the word cats contains two. One meaning the thing that purrs, and one meaning not one of them. So a typical English phrase has slightly more morphemes than words. And for now, we're going to define a word as a group of sounds that's conventionally surrounded by spaces when we write it. But what really constitutes a word is a whole talk of its own, and I just don't want to get into it. So let's just leave it at that for now. So let's take the phrase, I saw the cats running away. We can break this down into nine morphemes. I see past the cat plural run progressive away, right? So six words, nine morphemes. And that's a typical word to morpheme ratio in English. So we can calculate the same ratio for any language actually, and use that ratio to place it on a spectrum accordingly. So on one end of the spectrum, we have these analytic languages like Chinese and Vietnamese. And in these languages, they typically have about a one-to-one word to morpheme ratio. So everything's its own word, and they don't use a lot of affixes. And they indicate meaning by word order rather than by changes to the words themselves. As you move across the spectrum toward languages like Latin and Greek, which are known as synthetic languages, the ratio changes. So a word in one of these languages is a whole phrase in other languages. So one of the most synthetic languages is Inuktitut, which is an Inuit language. If you compare Chinese and Inuktitut, you can get a really good sense for the bounds of this spectrum. So take the phrase, if you wait for me, I will go with you. In Chinese, this phrase is nine words and nine morphemes. So the ratio is one-to-one. But in Inuktitut, this phrase is actually just two words. So the ratio is actually about two-to-nine here. So you can see how differently those work. So why do we care about this ratio? Well, basically it's a really good way to predict how much agreement a language exhibits. So agreement is where a characteristic of one word in a phrase will affect the form or inflection of other words in the phrase. So let's take this example in Swahili. So this means one book will be enough. But let's say we changed our minds and now we want two books. So in order to do that, we're going to have to change every other word in the phrase to match the new number. So you can see that in English there's also some agreement here. Book changes to books, but it's only one word because English is somewhere in the middle of that spectrum. In Swahili, you basically have to change every word. That was a fun year of my life in college, Swahili. Okay, so let's see. Comparise Swahili in English to Japanese, which exhibits no agreement at all. And this slide makes me wish I had studied Japanese instead of Swahili. Because look at this. That's so cool. One, two, three. Not bad. Good job, Japan. Keep it up. So that's what the spectrum is all about. Put it in terms you might be more familiar with. I don't know if this is true. But agreement is kind of like a measure of cyclomatic complexity. So if you change a word in one part of the phrase, it can have all these unintended side effects in all sorts of other parts of the phrase. I'm sorry if that hurt more than it helped, but it's true. I could explain more later. So when you change something in one place, yeah, it has all these other side effects. And it has a lot of redundancy built in. So agreement is very wet. Because you get the same things stored all over the place, like in that Swahili example. So lots and lots of things can trigger agreement. Person, gender, case, voice, aspect, tense, mood. The list goes on and on and on. So we don't need to worry about most of those luckily. Because when you're translating software, usually most of those don't come up. So let's just talk about the ones that do the most. And those happen to be person, gender, and number. So in Indo-European languages, person agreement is really not too bad. We usually have just these six person categories. You've probably all seen this chart. First, second, third, singular, plural. So verbs only agree with the subject in Indo-European languages. And this is a luxury that we really take for granted. Because it's not the case in every language. For example, in like Georgia or Georgia, they have something called poly-personal agreement. Which means the verb agrees not just with the subject, but also with like God knows what else in the sentence, right? Like in Hungarian, for example, this is Hungarian example that my Hungarian co-worker was nice enough to confirm, like, no, this is how great the language is. So the verb agrees with both the person and number of the subject. And also the specificity of the object if there is one. Just grok that. Do you guys speak Hungarian? I respect you. Specificity. Like a definite versus indefinite. Like a versus the kind of. So if you're talking about like a specific cat, you have to conjugate this verb one way. If you're talking about like a cat that I have not necessarily specified, it's like a whole different word. So that's person agreement. So you're all familiar with the English system, which has just two grammatical numbers. There's one and not one. And so here's how the common locale data repository represents those rules. It's really not too bad. We have it easy. So, yeah, if you were to do this in Rails, it's not too bad. You just have to provide, you know, a value for one and a value for not one. But in some languages, like Arabic, it's a little trickier. So this is how the common locale data repository represents Arabic civilization rules. And you just need to know that your app is going to have to be flexible enough to handle rules like this as well. The Russian one is equally complex and scary. Maybe more. All right. So gender. Gender is another common agreement category that you may have encountered. You've probably encountered it. So in other languages, grammatical gender is usually limited to either male female, male female neuter or male. So it's grammatically simpler than a lot of types of agreement. But that doesn't mean it's an easy problem. It's really, maybe even harder problem. Because sex and gender are just really charged issues. So you might remember when in 2008, Facebook started suddenly aggressively asking users to specify their sex. So before then, they'd been displaying this text. You know, user tagged themselves in a photo. And the user would say, this is a stupid thing to get worked up about. So they ignored it. But then, they started to internationalize. And they realize that that solution wasn't going to work as neatly in other languages. In fact, in some languages, it just doesn't work at all. There's no equivalent of this solution. So they decided to just ask from the missing information. They knew they were going to need to display all sorts of flexible stories of information. So they started to ask for the missing information in the name of grammar. So you might end up having to do the same. And I just wanted you to be aware of the issue and try not to be overly heteronormative about it. If you end up having to ask for that information. So now that you have a good idea of your target languages and the challenges they're going to give you, you're going to have to figure out a way to implement internationalization. So whatever frameworks you choose, your actual translations are going to be either machine generated, crowd sourced, or professionally translated. So depending on your implementation, you might be able to mix and match between these options. So first, let's talk about machine translations. So if you decide real internationalization doesn't make sense yet because you're constrained on time or money or whatever reason. Google and a few others offer some client side translation tools but they're not real internationalization, but they will provide your site in other languages and they will require almost nothing from you. So basically it just adds this nice selector to your site and it lets you pick between 65 supported languages. This is a really impressive list of languages for machine translation. So that's what you get out of the box when you add this to your site. So from a user's perspective they have access to both machine translations and the original text. And if they see a bad translation, this one, they can easily suggest a better one with a simple inline form. So the site administrator can also maintain a list of language editors who have access to these tools where they can approve and reject submitted translations. And you can also define a glossary of special terminology like your product name or other special terms you have that need for your admin tools there for free. Okay, so if you choose this plugin, storage is totally out of your hands, obviously, it's all in the cloud. Selection happens through the drop-down language selector that I showed you. Transformation happens when users click that contribute button and contribute corrections or when your admins define that glossary, etc. And then presentation is just, you know, it's really simple to implement this. So unfortunately, this tool is free and super easy, but there are major, major downsides. So the page jumps from your primary language to your target language on every request, which is a really bad user experience. And it's not very professional looking, obviously. And you have no ownership. You're completely relying on the continued goodwill of a third party, which if you've ever relied on a Google API, you're probably going to want to use it for a long time, because I have, and know that you probably shouldn't build a whole app around that because they will deprecate it at some point. So on top of that, you can't provide multilingual site search internally or externally. There's not much option there. So because of these downsides, at some point you're probably going to want to store your own translations. So once we have some kind of storage in place, next we'll have to figure out selection. So how should your app decide which content to pull for a particular request? We don't want to rely only on the accept language header because, well, for a bunch of reasons, bots don't send this header, so they can't index any other content than your default. So you should figure out something else. So what else can we do? Well, one easy option is to establish a locale pattern in your URL, whether that's through top-level domain, sub-domain, or some permanent part in your path. So if you do that, locale detection will be a really simple matter. You just take whatever's in the URL or you redirect there based on accept language. And then you can let the user make the set of preference from there. So once you have locale decided, all you have to do is pull that corresponding data from storage. So much like in our earlier example, transformation and presentation are a really simple matter of just applying the rules that you've stored for a particular language to the data in that request. And then rendering that for user consumption is super easy. So it looked like earlier a lot of you are involved in Rails, so that's good news because you're probably somewhat internationalized already. So Rails ships with this really extensible internationalization gem called IATN, SpiceFendFuchs, whose name you're going to see a lot. I'm not stocking it, but I wrote everything related to this topic ever. He's just the king of Rails internationalization. So he wrote this nice gem that gives you these basic methods like translate, localize, and transliterate and stuff like that. So what it comes with is a framework for the API and then a simple backend to get you started. So the backend by default stores your translations in these YAML files. And for example, this in YAML, you could do this in the console or wherever. So it produces the right translations with the appropriate punctuation and translations and all that for French. So if your localization needs are simple, you can kind of limp by with this YAML storage for a while, but I'm going to just tell you these totally suck to create and maintain and just keep synced up between languages and you're going to do it for five minutes and hate it, so just don't even bother. So luckily, it's really easy to store these translations in different ways. Let's see. So yeah, any persisting key value will store like Redis or whatever you want database. You can chain multiple backends however you want. Or if you have a ton of static content which probably doesn't apply to a lot of you, but it's an option, you can just include the locale in your template file name but because of what I talked about earlier, it's probably not a good idea to keep that logic all in your templates. So if you prefer to store your translations in a database instead of YAML or template files, you have a ton of options. The simplest back end is the I-18-N active record gem also by Sven Fuchs which stores translations in the database with active record. So obviously these need to be heavily, heavily cached for performance reasons. You pull tons of translations in every request, so duh. But the gem makes it really easy to set up. So this gem is super extensible. Here's an example of how you could include the active record missing module which it basically lazily populates your database with missing translation keys whenever it finds them. So that gives you lots of options for filling those in later on the back end somehow. And it lets it fall back to using the simple back end. So the previously discussed gems are really useful for translating snippets of UI text or anything else that would otherwise be stored in the template. But if you're storing text in the database for like a blog post or something, there's this gem called Globalize3 also by Sven Fuchs which is really great. It works well with any of the other options and all it does is scope active record getters and setters to the current locale. So it's very easy to use. There are a lot of ways to extend this default back end or the API however you want. So this I18N inflector gem is a good example of someone doing that. So this gem basically provides an additional layer of abstraction on top of the default translation back end for dealing with highly, highly inflected languages. Like Slavic languages, this totally makes sense but in English it's kind of ridiculous. So the linguistics nerd in me is a huge fan of this approach because it's really academically faithful. But in practice it's just not like a practical solution because imagine reading a template that contained this without the comment. It's really hard to read so probably I'm going to use it. But it is an example of how you can extend the default options. So Yammer uses this open source gem called Tron. That is how it's pronounced, I'm sure of that. To crowdsource translations. So it's easy to understand when you see it. So let's just walk through a typical Tron translation scenario. So let's say you just joined Yammer and you're having trouble reading the default English copy because your default language is Pirate. I don't know. So that happens to be one of the more popular user generating languages on Yammer. I don't understand but we all speak Pirate so it's a good example. So you click on the language, the current language in the footer and this brings up a language selector light box. So then you see the language and you click it and suddenly everything is in that language. Again, I did not add any of this. Like real Yammer users spent their time on this. For better or worse. So. But then one day you notice this untranslated phrase and you think someone should really fix that. So you go back down to the footer and you click start translating this time. So then you can see red and green underlines everywhere which indicate what's been translated and what still needs to be. So you go back to the phrase and click on it and this brings up a special form. So this one doesn't have any tokens. It's pretty easy. I just put in my translation and I can immediately see the results of my actions. Pretty cool. So if I see a translation that I really like or dislike I can vote on it or submit my own. Again, real users voting on Pirate translations. So we at Yammer can monitor these translations and translators with this default admin panel. So like really rich flush out admin panel that the gem provides out of the box. So it's pretty easy for us too. So now let's look at a case where we have to use context rules. I notice that this translation is ungrammatical and I want to fix it. So I right click it and I start entering my own translation. But this time I'm going to click generate context rules which brings up a different form. So now I have to tell Tron that the translation depends on the value, the numerical value of count. So that brings me to a different form. Tron is really easy to populate with local specific rules. So it already knows that in Pirate English, much like other dialects of English, there are two types of number, one and not one. So it generates a nice form for those and lets me fill them in. So I fill them in and the translation is much, much better. So on the back end all the translations and rules are stored in a really normalized way in the database. So it's pretty easy for us to navigate through. So we at Yammer love Tron. Since we started using it, we've collected about 70,000 translations for about 80 languages from our users. All by people who just wanted to use the site in their native language or just offer their expertise. So there are a lot of advantages to collecting translations this way. For one, obviously, it's free and it's not a bad way to pay for translations. Second, it allows for a kind of progressive enhancement. So you don't have to wait until every single word of your site is translated into a language to offer some benefit to your users. Even if you're at 50%, that's often a much better experience for users. And then if they say, hey, what about the other 50% you can be like, well, you speak it, translate it. So it's pretty nice. You also don't have to wait for translations to come back if you want to change or add text. You can just deploy that immediately and even if it's a really popular language, someone will just fill that in right away. So third, it's a totally self-monitoring system. So it requires very little oversight from us. So this is what, this is how babysitting works? I assume. So it's kind of like babysitting Tron. It's really easy. So one of Tron's big advantages for developers is that you can actually use the default language's value as the key. So here the value is hello name and we're also using it as the key because we have the second argument welcome message, which is a way for us to communicate the intent of that key. So for example, so this keeps code really readable while also providing a really important way to disambiguate keys for translators. So sometimes, like with invite versus invite in English, if that's your only word, you can't really use that key unless you have a way to communicate something else about it. So this lets us disambiguate. So crowdsourcing can be a great way to localize, but sometimes it makes sense to use professional translators. You might need better quality or more coverage than crowdsourcing can provide. Or maybe you want to carefully translate some really prominent landing page text or something, lock it down. So if this is the case, your main obstacle is going to be cutting out all of this bullshit that gets between having translations. So there are inevitable delays with introducing a relatively slow third-party development step like this. You just want to make everything as automatic as possible. But luckily, there are some services out there with great APIs that basically just let your app talk directly to translators. And this can reduce the turnaround time to just a few hours. So it's a really great way to do that. So you might be familiar with crowdsourcing options like Mechanical Turk or CrowdFlower. So it turns out there are translation-specific versions of that. And the one we're going to look at today is called Gango. Pretty good. The best part about Gango is they have this simple REST API and an open-source Ruby client. So this gives you a ton of flexibility for your deployment flow. It doesn't really take too much imagination to see how. So let's say you're using something like ActiveRecord Missing or really any backend that lazily populates the database with missing translation keys. So all you have to do is write, say, a simple rake task that just goes through your database, collects all the untranslated keys, and then posts them to Gango's jobs endpoint through that open-source Ruby client that I mentioned. So when the job is done, Gango will automatically ping your app through a callback URL, and then you can just automatically fetch and store those translations. And you really had to do nothing. It just filled in the missing translations and you paid them some number of money. That's between 5 and 15 cents a word, depending on the quality you request. So if you wanted to automate this even further, you could imagine kicking the task off from something like a GitHub or a Kappa Serrano recipe. You can, you know, use your imaginations. It can be good. So I hope this gives you a better sense for your options if you do decide to localize. And I hope you're not too scared about localization because it's an important and, I think, fun process. Thank you very much, and if you... APPLAUSE