Language support

Language support
	Chapter 20. Languages, characters and encoding

The DocBook XSL stylesheets support documents written in many languages. This support is made easier by the fact that XML itself supports Unicode, which includes characters for most of the world's languages. To write a DocBook document in a given language, you just have to identify a character encoding that expresses the language, and then indicate that character encoding in the XML declaration that must appear at the top of each XML file, such as <?xml version="1.0" encoding="iso-8859-1"?>. You write the text of your document using that character encoding, and you use the standard DocBook tags (which have English names) to mark the XML elements. Then you just have to make sure the XSLT processor you use supports your encoding.

The language support in the DocBook XSL stylesheets is primarily for generated text that the stylesheets produce. For example, an English document should label a chapter with Chapter 3, while a German document's chapter should be labeled Kapitel 3.
The XML document encoding does not tell the stylesheets what language the document is written in. You have to supply that information with either a lang attribute in the document or a stylesheet parameter at processing time.
Indexing in DocBook XSL does not sort properly for non-English languages. But there is a customization available that does sort properly. See the section “Internationalized indexes”.

Using the lang attribute

The preferred method of indicating language is by adding a lang attribute with a language code value, usually on the document root element . This method records the language within the document itself, so it is clear to anyone examining the document. Also, the attribute triggers automatic processing in that language by the stylesheets. That means you do not have to indicate the language on the processing command line.

Since lang is one of the common DocBook attributes, it is permissible for all DocBook elements. The attribute applies to the element it is in, and all of that element's descendants. If one of the descendants has a different lang attribute, then it overrides the ancestor's value for the scope of that descendant. For example, if a document's root element is book, you can put a lang attribute in the book start tag so it applies to the whole document. If one of your chapters is written in a different language, then it can have a lang attribute whose value applies only to that chapter. The following example illustrates this usage.

<book  lang="de">
  ...
  <chapter>
    <title>Profil verwalten</title>
    ...
  </chapter>
  <chapter  lang="en">
    <title>Special Features</title>
    ...
  </chapter>
  <chapter>
    <title>Junk-E-Mails vermeiden</title>
    ...
  </chapter>
</book>

In this example, the document root element sets the lang to de (German) for the document. So the chapters Profil verwalten and Junk-E-Mails vermeiden are processed as German. But the Special Features chapter has its own lang set to en (English). So the second chapter is processed as English. Its label will be Chapter in the chapter title page, the book's table of contents and any cross references to that chapter.

Using language parameters

You can also indicate the language of a document at processing time by using a stylesheet parameter set to a language code. This is useful if you are processing a document that does not have a lang attribute and you cannot edit it to add one, or if you want to override the attribute it does have. There are two stylesheet parameters that can be used to set the processing language:

The parameterl10n.gentext.language will override any lang attribute set in the document. This parameter is only needed if the document is a single language that is not English, and one of the following conditions.
- It does not have a lang attribute.
- The lang attribute it does have is wrong.
- The lang attribute it does have is not one of those supported by the stylesheets.
The parameterl10n.gentext.default.language can be used in the same circumstances as the previous parameter, but it will not override any lang attributes in the stylesheet. It will apply only to those elements for which no lang attribute applies. Thus if there is a lang attribute on the document's root element, then the parameter will have no effect.

If you wondering about the names of these parameters, you probably do not recognize the odd abbreviation l10n, which is a lower case L followed by the number 10 and the letter n. This is an abbreviation of “localization” (the first and last letters, and 10 letters in between). It means the gentext strings are adapted to a particular locale in the world. This abbreviation is similar to i18n, which is an abbreviation for “internationalization”.

Language codes

As of this writing, DocBook XSL supports 45 languages. That means it has translations for the generated text strings in 45 languages. The translations are stored in XML files named for the language code, such as en.xml, fr.xml, etc. These are stored in the common subdirectory of the stylesheet distribution. So if you want to check if a given language is supported, look in that directory for an XML file of that name. The top of each file looks like the following:

<?xml version="1.0" encoding="US-ASCII"?>
<l:l10n xmlns:l="http://docbook.sourceforge.net/xmlns/l10n/1.0" 
        language="it"
        english-language-name="Italian">

The language attribute identifies the language code. It is this attribute value that the stylesheet uses to match to a lang attribute in a document. The filename just happens to have the same name. The english-language-name attribute gives the language name in English for each language.

Most of the language codes are two-letters, named using the ISO 639 standard. A few have variations to reflect how a given language is used in a different country. For example the pt_br language is for Portuguese as spoken in Brazil. The country codes that are used in the second part of the name are listed in the ISO 3166 alpha-2 standard.

When you specify a language code for your document in an attribute or parameter, you can use upper- or lower-case letters. If it has a country extension, you can use either dash or underscore as the separator. In all these cases the stylesheets will map the code to the supported value.

If you specify a country extension, and there is no translation for that extension, the stylesheet will fall back to using just the two-letter language code. If a two-letter code is not supported, then the stylesheets fall back to English.

Extending the set of languages

In theory, DocBook XSL can support any language that can be expressed in Unicode. In practice, only 45 languages have translated text strings that the stylesheets can access. If you need a language that is not currently available, then you can make the translations and add them to your stylesheets. You should copy the English file common/en.xml to a new language code XML file, and then translate the text attributes in the file. The translations should use Unicode numerical character references for any non-ASCII characters.

The easiest way to add a new language to the stylesheets is to submit your translation to the DocBook XSL project for integration into the next release. Send email to the project admins at the DocBook SourceForge site. Then your new translation will be included in future stylesheet distributions. It also makes it available to other users, who can make contributions to it as well.

If you want to include your translation only in your own stylesheet, you need to do the following:

Copy the stylesheet file common/l10n.xml to a new filename, such as common/my-l10n.xml. It is best to keep it in the same directory because it references all the other language files in that directory.
Edit your new file to add a SYSTEM entity declaration to the DOCTYPE and an entity reference to the body of your copied file. Just copy similar lines from the file itself. The entity declaration should point to your new language file location, relative to the common directory.
```
<!ENTITY fy SYSTEM "../mystuff/fy.xml">
...
&fy;
```
Create a stylesheet customization layer if you do not already have one.
Add the following line to your customization file:
```
<xsl:param name="l10n.xml"
     select="document('../mystuff/my-l10n.xml')"/>
```
The path to your enhanced my-l10n.xml file should be relative to your stylesheet customization file.

The document() function loads your customized file into the stylesheet parameter l10n.xml. That parameter is searched when looking for a translation.

This arrangement is a bit awkward, and will need to be repeated with each new stylesheet release. It's best to complete the translation and submit it to the DocBook project.

Text direction

Some languages, such as Hebrew and Arabic, read from right to left. When viewing an XML source file, you might think that it reads from left to right, but that view is just an artifact of the viewing device. In fact, an XML file is a linear sequence of bytes, with no particular direction except from beginning to end. The file is in logical order, with the beginning of each word appearing earlier in the file than the end of the word, regardless of the language. Any device that interprets the bytes and assigns displayable characters has to choose how to lay out those characters in some readable fashion. For some languages, that presentation is left to right, and for others it is right to left.

There are two principal properties that determine the direction of text:

Writing mode sets the overall direction for the document.
dir attributes change the direction for specific spans of text.

Note that most right-to-left languages are actually bidirectional, because numbers still read from left-to-right, and any words in the Latin alphabet, such as technical terms, still read from left-to-right.

Writing mode

Writing mode is a term from XSL-FO that describes the overall plan for laying out text onto a page. A writing mode is a combination of horizontal direction and vertical direction for text flows. For example, an XSL-FO output with writing-mode="lr-tb" displays inline text that flows from left-to-right (lr), and lines that stack down the page from top-to-bottom (tb). Similarly, in rl-tb the inline text flows from right-to-left, and again the lines stack down the page from top-to-bottom.

If you are conditioned to Latin-based languages that read left-to-right, you may not realize how important the left side is for text layout. Indents that show hierarchy are indented from the left. Numbers in orderedlist and bullets in itemizedlist appear on the left. When outputting a right-to-left language such as Arabic or Hebrew, putting such features on the left does not work. The importance of these formatting features is not that they appear on the left, it is that they appear at the start of the line. The XSL-FO standard recognizes this, and uses the term start-indent instead of left-indent.

When writing-mode="rl-tb" (right-to-left), the start-indent property is applied to the right side. Similarly, bullets and numbers appear on the right, at the start of their line. Tables are also reversed, that is, the first table-cell in each row appears on the right.

You can set the writing mode for XSL-FO output by adding an attribute to the root.properties attribute-set:

<xsl:attribute-set name="root.properties">
  <xsl:attribute name="writing-mode">rl-tb</xsl:attribute>
</xsl:attribute-set>

When you set this property, you will find that your print pages are mirror images of the left-to-right writing mode. Even page headers and footers will be mirrored, because they use tables to lay out the different portions of the headers and footers, and the order of table cells is reversed. You may want to swap your values for the page.margin.inner and page.margin.outer parameters, because the side for binding would change.

Note

If you set writing-mode="rl-tb" in a document using a Latin-based language, the text does not print backwards. Only the layout is mirrored. As described in the next section, the text direction is based on the Unicode character range in use.

For HTML output, a right-to-left writing mode can be established by adding a dir="rtl" attribute to the HTML document element in the output. This currently requires using a customization, which differs if you are doing single-page or chunked output.

Single-page HTML, customize this template from docbook.xsl:
<xsl:template match="*" mode="process.root">
  <xsl:variable name="doc" select="self::*"/>

  <xsl:call-template name="user.preroot"/>
  <xsl:call-template name="root.messages"/>

  <html>
    <xsl:variable name="lang">
      <xsl:call-template name="l10n.language"/>
    </xsl:variable>

    <xsl:if test="starts-with($lang, 'he') or
                  starts-with($lang, 'ar')">
      <xsl:attribute name="dir">rtl</xsl:attribute>
    </xsl:if>
  ...

Chunked HTML, customize this template from chunk-common.xsl:
<xsl:template name="chunk-element-content">
  <xsl:param name="prev"/>
  <xsl:param name="next"/>
  <xsl:param name="nav.context"/>
  <xsl:param name="content">
    <xsl:apply-imports/>
  </xsl:param>

  <xsl:call-template name="user.preroot"/>

  <html>
    <xsl:variable name="lang">
      <xsl:call-template name="l10n.language"/>
    </xsl:variable>

    <xsl:if test="starts-with($lang, 'he') or
                  starts-with($lang, 'ar')">
      <xsl:attribute name="dir">rtl</xsl:attribute>
    </xsl:if>
  ...

These customizations call the utility template named l10n.language to get the current document's lang attribute. It then checks to see if it starts with either he (Hebrew) or ar (Arabic) and adds the dir attribute.

dir attribute

When processing content for output, you will find that the inline text direction is mostly handled automatically. That is, if you process an XML document containing Arabic, the formatted output will present the Arabic words from right to left, and any English words from left to right.

How does the formatter know when to switch the direction of presentation? It knows by the range of Unicode characters used in each word. Part of the information in the Unicode standard is the text direction that each range of characters is expected to be presented in. Latin letters are to be presented left to right, and Hebrew characters from right to left. Modern browsers and XSL-FO processors use that information to decide the direction of presentation. In mixed language text, sometimes called bidirectional text, the direction can change in mid-sentence. When a formatter encounters a bit of text that should be displayed in the opposite direction, it has to read forward to find the end of such text, print it out character-by-character reading backwards from the end, and then resume normal layout of the text that follows.

There are some combinations of text that make this task harder for the formatter. Punctuation, parentheses, numbers mixed with letters, and other combinations may present ambigous information to the formatter. In such cases, the author may need to provide some help to the formatter through the XML markup.

The DocBook schemas starting with version 4.3 have supported an attribute named dir on almost all elements. The dir attribute provides a hint to the formatter for which direction to display the text enclosed by the element with that attribute. There are four possible values:

dir attribute value	Unicode Name	Description
`ltr`	Left-to-Right Embedding	Embed a span of left-to-right characters inside right-to-left text.
`rtl`	Right-to-Left Embedding	Embed a span of right-to-left characters inside left-to-right text.
`lro`	Left-to-Right Override	Force the characters to be treated as strong left-to-right characters.
`rlo`	Right-to-Left Override	Force the characters to be treated as strong right-to-left characters.

You can put a dir attribute on any inline element. Use phrase if the text is not already inside an inline element. That is particularly useful for problems with parentheses or punctuation. You do not need to write a customization in order for these attribute values to have their effect. They automatically output the correct properties in HTML or XSL-FO for inline text elements. Then it is up to the browser or XSL-FO processor to handle it.

Start and end

When working with a language such as Hebrew or Arabic that reads right to left, you need to pay attention to the XSL-FO start and end terminology. These designate the two sides of a page, but which side each refers to depends on the writing mode (see the section “Writing mode” for details). The term start refers to the side of a page that a sentence starts from. With the default writing-mode="lr-tb", the term start refers to the left side. If you set writing-mode="rl-tb" (right to left), then start means the right side. In each case, end means the opposite side, where a sentence ends.

For example, when you set the body.start.indent stylesheet parameter to indent paragraphs relative to titles, it inserts a start-indent property in the XSL-FO output. That creates an indent on the left by default. But when you use right-to-left writing mode, the indents will be on the right, which is appropriate for those languages.


Special characters		Chapter 21. Lists