Internationalized indexes

Indexes that are generated from indexterm elements are a challenge for the stylesheets. The index entries have to be collected from throughout the document, sorted into groups, and sorted within each group, before being formatted in the output. How index terms are sorted and grouped is highly dependent on the language and alphabet being used in the document. English is probably the simplest, with only 26 letters. Latin alphabet languages that add accented characters such as French are a bit more complex, and ideographic languages such as Chinese are even more complex. The DocBook XSL stylesheets provide the tools and methods to handle all of these languages and alphabets.

There are two processes that govern how an index is generated: sorting and grouping. Sorting means arranging the characters of an alphabet in a certain order. Grouping means treating a set of characters as the same character as far as assigning them to index sections (the A, B, C sections in an English index). For example, words starting with a, A, á and à should all be in an index section labelled A. Likewise, all the letters in the group should sort as if they were the same letter. For example, áb, ac, and ád should sort in that order, and not sort the two words starting with á together.

The basic steps for generating an index are to put each entry in a group based on its first letter, sort the groups, and then sort the words within each group. The DocBook XSL stylesheets provide three different methods to perform these functions (starting with version 1.71.1 of the stylesheets). The index method is chosen using the stylesheet parameter index.method, which can have one of three values.

Table 19.1. Index methods

index.methodDescriptionGrouping Sorting
basic
  • Suitable for English and many Latin-based languages.

  • It does not require any extensions, so it works with any XSLT processor.

  • Not configurable.

  • The default indexing method.

Index groups are defined only for the 26 English letters, and the order of the groups is fixed. A word is assigned to a group by internally mapping accented characters to their unaccented version.The letter groups are output in fixed order. Sorting within each group is based on the language algorithm available to the XSLT processor.
kosek
  • Suitable for languages that need more index groups or different group order, and for which it is feasible to assign all letters to a group.

  • Uses custom XSLT functions that do not work with the xsltproc processor.

  • Requires using a customization layer to import templates.

  • Employs a user-customizable index configuration that part of the DocBook gentext file that is specific to each locale.

  • Named for its author, Jirka Kosek.

Any number of index groups can be created in the configuration for a given locale. Each group must identify which letters are included in that group. The groups can be output in any order based on the configuration. Sorting within a group is based on the language algorithm available to the XSLT processor.
kimber
  • Suitable for all languages and alphabet types, including ideographic languages.

  • Uses Java extension functions that only work with Saxon.

  • Requires using a customization layer to import templates.

  • Employs a user-customizable external configuration file that has a section for each locale.

  • The most flexible method, but more complicated to set up.

  • Named for its author Eliot Kimber.

Any number of index groups can be created in the configuration for a given locale. Groups are specified by group membership lists, or by specifying break points in the sort order. A custom Java sort algorithm can be specified. A separate Java collation configuration file can also be specified. In the current version, the order of groups is always based on the current sort order.

Each of these index methods is described in more detail in the sections that follow.

index.method = "basic"

When the stylesheet parameter index.method is set to basic, the default indexing processing is used. Since this is the default value, all you need to do is add an empty index element to your document in the location where you want an index to be generated. The stylesheet will do the rest.

If you are wondering how the basic method operates, here is a summary of the steps:

  1. The stylesheet defines the mapping of characters to letter groups using the XSLT translate() function, which can substitute one character for another from the lists in its second and third arguments. The lists come from two text entities named &lowercase; and &uppercase; that are defined in the stylesheet (in the file common/entities.ent). The names are misleading, because the lowercase list includes lowercase and uppercase letters in accented and unaccented form. The uppercase list is repeated copies of the corresponding uppercase, unaccented letter. For example:

    lowercase:
    AaÀàÁáÂ...Bb...
    
    uppercase:
    AAAAAAA...BB...
  2. An xsl:key is created with name letter, which contains all the indexterm elements. The access key is the first letter of its primary child element, mapped to the uppercase, unaccented version of the letter using the translate function. Then by specifying an access key of A, the stylesheet can immediately find all indexterm elements whose primary element starts with any of these letters:

    A a À à Á á Â ...
  3. The template named generate-basic-index is called, which first gathers the first instance of each indexterm with a unique access key value. Thus it gather the first A term if there is one, the first B, etc. Any letters for which there are no entries are omitted, and so no empty index letter sections will be created.

  4. The select group of entries is sorted using the sort algorithm available to the XSLT processor. It uses the document lang attribute if available, otherwise it defaults to English. This puts the groups into the order they will be presented in the index.

  5. Each indexterm in the select group is processed in mode="index-div-basic", which starts a new index letter group.

  6. In each invocation of mode="index-div-basic", all indexterm elements matching that particular access key are gathered. Then they are sorted and formatted within that letter group. Although the selection key is the uppercase, unaccented letter, each entry is output using its original characters that include lowercase and accented letters.

index.method = "kosek"

When the stylesheet parameter index.method is set to kosek, a different indexing process is used. This method adds the ability to define indexing groups and their members, and the order of the groups within the index.

However, setting the parameter is not sufficient. You must use a customization layer in order to import a supplemental stylesheet module that contains some additional templates that are needed by this method. For example:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:fo="http://www.w3.org/1999/XSL/Format"
                version="1.0">
<xsl:param name="index.method">kosek</xsl:param>
<xsl:import href="docbook-xsl/fo/docbook.xsl"/>
<xsl:import href="docbook-xsl/fo/autoidx-kosek.xsl"/>
...

For an HTML customization layer, you would import the corresponding autoidx-kosek.xsl file from the html or xhtml stylesheet directory.

The definitions of the groups used by this method are in the gentext locale files, such as common/fr.xml for French.

<l:letters>
      <l:l i="-1"/>
      <l:l i="0">Symboles</l:l>
      <l:l i="1">A</l:l>
      <l:l i="1">a</l:l>
      <l:l i="1">&#224;</l:l>
      <l:l i="1">&#192;</l:l>
      <l:l i="1">&#226;</l:l>
      <l:l i="1">&#194;</l:l>
      <l:l i="1">&#198;</l:l>
      <l:l i="1">&#230;</l:l>
      <l:l i="2">B</l:l>
      <l:l i="2">b</l:l>
      ...

Note these features of the configuration:

  • All members are contained in the l:letters element, with each member specified by a l:l element.

  • The groups are identified by the different values of the i attribute in each letter element l:l. In this example, group 1 has all the “A” letters, group 2 has all the B's, etc. Included in the “A” group are all the accented versions of upper- and lowercase A, entered as Unicode character values (e.g., &#224; which is à). In this way, you can expand each group to include new characters if necessary, and create new groups with a new number index.

  • The first member of each group is used as the displayed section heading for that group in the index. So group 1 would display A as the section heading.

  • Any numerical i values can be used, not necessarily contiguous values (e.g., 10, 20, 30, etc.).

  • The order of groups in the index is based on the ascending numerical order of the i values that identify the groups.

  • All index entries that do not fall into one of the letter groups are assigned to the group with i="0" for symbols.

All of the gentext files in the common directory of the stylesheet distribution have a set of groups defined. But many have not yet been actually prepared for the specific language, and are just a copy of the groups from the English file (which has many accented characters in its groups anyway). You can identify such groups by the lang="en" attribute on the l:letters element in the gentext file. If your language has not been properly prepared, you can create a further customization of the gentext elements. That process is described in the section “Customizing generated text”. If you are confident that it is correct, you could submit it back to the DocBook development team for inclusion in future releases.

The sorting of entries within each group is handled in the stylesheet by an xsl:sort element, with a lang attribute whose value is taken from the document being processed. So you must have an appropriate lang attribute on the root element of your document.

XSLT processors hand off the actual sorting process to the operating system. So the results will depend on how well your operating system can sort the language specified. If it does not have the proper collation rules for your language, then the results will likely be unsatisfactory.

For the customization to work, the XSLT processor must be able to use EXSLT extension functions, and it must be able to use them in xsl:key elements. Saxon is known to work with the customization. But xsltproc does not support using the EXSLT extensions in xsl:key, and so will not work.

index.method = "kimber"

When the stylesheet parameter index.method is set to kimber, Java extension functions are used in building the index. The Java classes used in this method were written by Eliot Kimber of Innodata Isogen, Inc. and donated to the open source community. A white paper describing the package is available at:

This index method is the most flexible and powerful, but also the most difficult to set up. Currently the Saxon 6 and 8 processors work with the extensions. Here is what you need to do.

  1. Download and unpack to some convenient location the Innodata Isogen Internationalization Support Library. You may need to register (for free) before getting access.

  2. Set the stylesheet parameter index.method to kimber.

  3. Create a customization layer in order to import a supplemental stylesheet module that contains some additional templates that are needed by this method. For example:

    <?xml version="1.0"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    xmlns:fo="http://www.w3.org/1999/XSL/Format"
                    version="1.0">
    <xsl:param name="index.method">kimber</xsl:param>
    <xsl:import href="docbook-xsl/fo/docbook.xsl"/>
    <xsl:import href="docbook-xsl/fo/autoidx-kimber.xsl"/>
    ...

    For an HTML customization layer, you would import the corresponding autoidx-kimber.xsl file from the html or xhtml stylesheet directory.

  4. During processing with Saxon, you must add the jar file to your CLASSPATH and set a Java property to point to the configuration files.

    java  \
      -cp "/xml/java/saxon.jar:/xml/java/i18n_support/i18n_support.jar" \
      -Dcom.innodata.i18n.home="/xml/java/i18n_support" \
      com.icl.saxon.StyleSheet \
      -o myfile.fo \
      myfile.xml \
      docbook-xsl/fo/docbook.xsl  \
    

If all the pieces are in place, you should get an index that is generated using the Java extensions. If you had not set a lang attribute on your document's root element, or if the value did not match those in the configuration file, then you will see a message like the following:

 - Failed to find index configuration for language 'xy', trying English.

This tells you that the code is working, but the configuration is not quite right. In the location where you unpacked the library, you will find the indexing configuration file config/botb_index_rules/botb_index_rules.xml (here botb means “back-of-the-book”). That configuration file has an index_config element for each locale, containing all the configuration elements for that language. See the white paper mentioned above for details on the configuration elements.

In order for the configuration to work for your language, the lang attribute on your document root element must match the value of a national_language element in the configuration file. For example, here is the start of the configuration for Czech:

<index_config>
  <national_language>cs-CZ</national_language>
  <description> <p>Czech index configuration</p> </description>
  <collation_spec></collation_spec>
  <sort_method>
    <sort_by_members/>
  </sort_method>
  <group_definitions>
    <term_group>
      <group_key>A</group_key>
      <group_members>
        <char_or_seq>a</char_or_seq>
        <char_or_seq>A</char_or_seq>
        <char_or_seq>Á</char_or_seq>
        <char_or_seq>á</char_or_seq>
      </group_members>
    </term_group>
    <term_group>
      <group_key>B</group_key>
      <group_members>
        <char_or_seq>b</char_or_seq>
        <char_or_seq>B</char_or_seq>
        ...

A document element should have a lang="cs-CZ" in order for this index configuration to be used. Otherwise edit the configuration file's national_language element to match your document's lang value.

For ideographic languages such as Chinese and Korean, the configuration file can use the sort_between_keys sort method, and specify an optional collation rules file, as shown in the following example:

<index_config>
  <national_language>ko-KR</national_language>
  <description>
    <p>Index configuration for Korean</p>
  </description>
  <collation_spec>
    <java_collation_spec>
      <include_collation_spec>ko-sort-rules.txt</include_collation_spec>
    </java_collation_spec>
  </collation_spec>
  <sort_method>
    <sort_between_keys/>
  </sort_method>
  <group_definitions>
  <term_group>
    <group_key>&#x3131;</group_key>
    <group_members></group_members>
  </term_group>

When sort_between_keys is used, all of the index terms are sorted into a stream for processing. As each term is processed, if its first character matches a group_key value, that triggers the start of a new index group. All characters in the stream up to the next match are in that group. This method is most suitable for alphabets that have thousands of characters that make group lists impractical. You can also specify your own sort order for the characters by specifying a file containing a Java collation specification. See the white paper described above for more information.