Cleaning up an FO index

The index page number problems described in the previous section cannot be solved by the DocBook XSL stylesheets because the page number for a given indexterm is not known in the XSLT step. Text is placed on pages by the XSL-FO processor, which does not necessarily recognize that text is an index entry. Also, there are no properties in the XSL-FO standard to consolidate page ranges.

Some FO processors such as XEP and Antenna House have extension functions that can be used to fix up index page numbers. The DocBook XSL stylesheets output these indexing extensions if the xep.extensions parameter or the axf.extensions parameter, respectively, is set to 1. The FOP processor does not yet have such extensions.

For FOP, one solution to this problem is to extract page number information from the PDF output file, and then use that to fix up the FO file. This method is described briefly on the reference page for the make.index.markup parameter. The following is a summary of the steps.

  1. You need a utility named pstotext to extract information from PDF files. It is available packaged in an RPM for Linux from http://rpmfind.net.

  2. Process your document containing an empty <index/> element with the fo/docbook.xsl stylesheet with the make.index.markup parameter set to 1. That will generate the index but will insert it as XML markup in the FO file. For example:

    xsltproc  -o mybook.fo  \
        --stringparam  make.index.markup  1  \
        fo/docbook.xsl  mybook.xml
  3. Convert the FO file to PDF using your favorite XSL-FO processor.

  4. Execute this Perl script on your PDF file and save the output to a file:

    fo/pdf2index  mybook.pdf  >  myindex.xml

    The content of that myindex.xml is an index marked up with DocBook index elements, with page information inserted as well.

  5. Replace the empty <index/> element in your document with the contents of this generated file. You can do it with a system entity or XInclude.

  6. Process your document again with fo/docbook.xsl and your favorite XSL-FO processor, this time omitting the make.index.markup parameter.

The result of this process is a PDF file for your document that contains an index with page numbers properly collapsed. Duplicate numbers should be removed, and sequences of consecutive pages should appear as page ranges.