Translation for package developers

In this vignette you’ll learn how to set up your package for translation, focusing on translating messages in your R code. It’s aimed at a package developers; if you’re translating an existing package, you might want to start with vignette("translators").

library(potools)

Basic process

Before we get into the details lets review the basic process:

Extraction

potools provides two styles for extracting messages for translation: base and explicit, as described below. Once you’ve decided which style you want to use, record it in the DESCRIPTION:

Config/potools/style: explicit

And then run po_extract() to generate the .pot file.

Base style

The base style captures messages the base functions that include built-in translation capabilities1: message(), warning(), and stop(). It also captures messages from the explicit translation functions gettext(), gettextf(), and ngettext(). (It does not, however, not translate cat()). The advantage of the base style, is that it’s very quick to get started with.

message(), warning() and stop() concatenate the components of :

However, as you’ll learn shortly, this style is unlikely to generate messages that are easily translated, so po_extract(style = "base") will also capture messages from messagef(), warningf(), and stopf(). These are equivalents to message(), warnings(), and stop() that use sprintf() style (hence the f suffix).

These functions are not included in base R, so if you want to use them, you’ll need to copy the definitions from below:

Explicit style

The explicit style only captures messages that explicitly flagged for translation by gettext(), ngettext(), or tr_(). Like messagef() and friends, tr_() is not provided by R, so you’ll need to define it yourself:

The advantage of the explicit style is that it’s very clear which messages are ready for translated. The disadvantage is that it’s easy to miss string, so in the future we’ll provide automated ways to identify strings that haven’t been translated and probably should be.

I’ll use the explicit style in the rest of this vignette because it makes it very clear what is being translated.

Writing good messages

The mechanics of translating your package are quite straightforward. The bigger challenge is writing messages that are easy to translate. In part, this is an extension of writing messages that are easy to understand in English as well! And if it’s hard for a native English speaker to understand your message, it’s going to be even harder once it’s translated into another language. The following sections give some advice about how to write good messages, as inspired by the “Preparing translatable strings” section of the gettext2 manual.

Write full sentences

Generally, you should strive to make sure that each message comes from a single string (i.e. lives within a single "“). Take this simple greeting where I translate”good" and “morning” individually:

This will pose two challenges for translators:

Instead it’s better to generate the complete message in a single string using glue() or sprintf() 3 to interpolate in the parts that vary:

Then the translator sees something like this:

msgid "Good morning {name}!"
msgstr ""

This gives the translator enough context to create a good translation and the freedom to change word order to make a grammatically correct sentence in their language. We can make the problem more challenging by making our greeting more flexible:

This would generate the following sequence of translations for French:

msgid "Good"
msgstr "Bon"

msgid "morning"
msgstr "matin"

msgid "afternoon"
msgstr "après midi"

msgid "evening"
msgstr "soirée"

Unfortunately this breakdown won’t generate correct French. The three greetings should be “Bonjour” for morning and afternoon, and “Bonsoir” for evening. There are two problems: good morning and good afternoon both use bonjour (even though French has different words for morning and afternoon; bon après-midi is used as a farewell), and the two word English phrases turn into single French words.

If you were translating to Mongolian you’d face a different problem. While Mongolian uses the same times of day, it arranges the words in the opposite order to English: “Өглөөний мэнд” is morning greetings, “Өдрийн мэнд” is afternoon greetings, and “Оройн мэнд” is evening greetings.

Again, we need to resolve this problem by moving away from translating fragments and towards translating complete sentences. One way to do that here would be to restrict ourselves to a fixed set of time points and use switch() to specify the greeting:

This works for French (and Mongolian):

msgstr: "Good morning {name}!"
msgid: "Bonjour {name}!"

msgstr: "Good afternoon {name}!"
msgid: "Bonjour {name}!"

msgstr: "Good evening {name}!"
msgid: "Bonsoir {name}!"

However, it’s still not a fully general solution as it assumes that the time of day is the most important characteristic of a greeting, and that the day is broken down into at most three components. Neither is true in general:

Greetings are particularly challenging to translate because of their great cultural variation; fortunately most messages in R packages won’t require such nuance.

sprintf() vs glue()

In R, there are two common ways to interpolate variables into a string: sprintf() and glue(). There are pros and cons to each:

The difference may be more important than you realize – as mentioned above, some languages (e.g., Turkish, Korean, and Japanese) assemble phrases into sentences in a different order. “I have 7 apples” becomes “7りんごをもっています” in Japanese, i.e. “7 apples [I’m] holding” – the verb & subject switched places. The reordering of templates in your messages is going to be quite common if you want your messages available in more than a very limited set of languages.

Un-translatable content

You can use interpolation to avoid including un-translatable components like URLs or email addresses into a message. This is good practice because it saves work for the translators, makes it easier for them to see changes to the text, and avoids the chance of a translator accidentally introducing a typo. It works something like this:

Similarly, if you’re generating strings that include in HTML, avoid including the HTML in the translated string, and instead translate just the words:

Generally, you want to help the translator spend as much time as possible helping you out.

Googling

It’s worth noting that that non-English messages are often harder to Google because few non-English languages have a significant presence on StackOverflow. A few suggestions:

Plurals

In English, most nouns have different forms for one item (the singular) or more than one item (the plural, also used for zero items). Or, more formally, in English the grammatical count has two forms: singular and plural. So you might be tempted to construct a sentence like this:

But this doesn’t always work, even in English:

Again, we always want to construct a complete sentence:

But there’s an additional wrinkle here: while English has singular and plural, other languages have different forms like singular (1), dual (2), and plural (3 or more), or singular (1), paucal (a few), and plural (many). So we need to use a different helper: ngettext(n, singular, plural):

ngettext() generates a special form in the .po file, which shows both singular and plural forms:

msgid "There is {n} cow in the field"
msgid_plural "There are {n} cows in the field"

Then the translator can supply any number of translations. Languages that don’t have plurals (e.g. Chinese) only need to supply a single translation:

msgid "There is {n} cow in the field"
msgid_plural "There are {n} cows in the field"
msgstr[0] "田裡有{n}頭牛"

Russian has three forms, so it gets three entries (roughly 1, 2-4, and everything else):

msgid "There is {n} cow in the field"
msgid_plural "There are {n} cows in the field"
msgstr[0] "В поле {xn} корова"
msgstr[1] "В поле {n} коровы"
msgstr[2] "В поле {n} коров"

Slovenian and Serbian have four forms, Irish has five forms (learn more at bitesize Irish), and Arabic has six forms. The rules that define which number get which message are quite complex and encoded in the “plural form” that’s recorded at the top of the .po file and looks something like (n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20)? 1 : 2). However, this is something the translators need to worry about not you, and as speakers of the language they should find it easier to puzzle out the rules.

Collapsed lists

In English we typically lists of items like “a, b, or c”, where the use of the serial, or Oxford, comma being a hotly debated style preference. European languages follow the mostly same form, although none use the Oxford comma, and they obviously translate “or”4. The and package takes care of these details:

Which also works will in glue:


  1. You can tell they have translation capabilities because they include the domain argument; behind the scenes they all call gettext().

  2. gettext is the underlying library that powers R’s translation system.

  3. If you’re using the “base” style, you could instead write gettextf("Good morning %s", name); gettextf() is a version of sprintf() that translates the first argument.

  4. In Spanish and Italian, the word used varies based on the start of the following word.

  翻译: