Processing DocBook 5

Processing DocBook 5 files typically has two phases:

Each of these is described in the following sections.

DocBook 5 validation

With the proliferation of XML schema languages, validation of a document has gotten more complicated. Validation used to be simple when there was only a DTD to validate a document against. Each document identified its DTD in its DOCTYPE declaration, and there were many tools available to validate it.

You can still use those same tools if you choose to use the DTD version of the DocBook 5 schema instead of RelaxNG. However, the DocBook 5 DTD is different from the RelaxNG DocBook 5 standard in several ways, and you should not expect all documents that validate against the DTD version to also validate against the RelaxNG schema, and vice versa.

All of the DocBook 5 schemas can be downloaded from DocBook.org. The different schema types are in different subdirectories of the distribution. The RelaxNG schema comes in two versions: one with support for XIncludes and one without. If you plan to use XIncludes (as described in Chapter 23, Modular DocBook files), then use the filename starting with docbookxi instead of docbook. Also, each version comes in the two RelaxNG syntax types: the full XML syntax version with a .rng filename extension, and the compact syntax version with a .rnc extension.

If you want to validate against the DocBook 5 RelaxNG schema, then you have to find the right validation tool. The DocBook 5 RelaxNG schema includes embedded Schematron rules to express certain constraints on some content models. For example, a Schematron rule is added to prevent a sidebar element from containing another sidebar. For complete validation, a validator needs to check both the RelaxNG content models and the Schematron rules.

Many people use xmllint from the libxml2 toolkit to validate their DocBook 4 documents. Although xmllint has a --relaxng option to validate against a RelaxNG schema (XML syntax only), it does not process Schematron rules. So you will not be able to fully validate a DocBook 5 document with xmllint currently.

The following sections describe three tools that handle complete DocBook 5 validation, including Schematron rules.

Sun MSV

Sun Microsystem's Multi-Schema Validator is a free tool that can be used to validate a document against more than one schema at a time. This is useful if your document has components from more than one namespace, each of which has its own schema.

To use MSV with the DocBook 5 RelaxNG schema, you need to get the relames version which has support for embedded Schematron rules, which are used in DocBook 5.0.

Download the latest version of relames from Sun's MSV website. Unpack the zip file and note the location of the relames.jar file. To validate a file, use a command like the following:

java -Xss512K -jar relames.jar rng/docbook.rng mydocument.xml

The -Xss512K option raises the Java stack size to avoid stack overflow errors.

More details on using Sun's Multi-Schema Validator are available in DocBook 5.0: The Transition Guide.

Topologi validator

Topologi Pty. Ltd. makes a number of commercial XML tools, but it also makes available a free validation tool for RelaxNG schemas with embedded Schematron rules. Their Schematron.zip implementation is available for download at http://www.topologi.com/resources/Schematron.zip. The download includes a Windows batch file for running the validator:

The batch file syntax:
EmbRNG_java.bat document RNG-schema

Example batch command:
EmbRNG_java.bat  mybook5.xml  docbook50/rng/docbook.rng

You have to execute the batch file from within its own directory because it depends on relative paths to its jar files and XSL stylesheets. The document and schema can be elsewhere. The validator will check the RelaxNG content rules, as well as any embedded Schematron rules. You can only use this validator with the full XML syntax version of RelaxNG (docbook.rng), not the compact syntax version (docbook.rnc).

Oxygen XML editor

The commercial product Oxygen XML editor can validate a DocBook 5 document. It handles both the RelaxNG content models and the embedded Schematron rules. It can also validate while you are editing the document.

To associate the DocBook 5 RelaxNG schema with DocBook 5 documents, you can associate the DocBook 5 namespace with the schema pathname. Use this menu sequence:

Options+Preferences+Editor+Default Schema Associations

More details on using Oxygen are available in DocBook 5.0: The Transition Guide.

DocBook 5 XSLT processing

To process your DocBook 5 documents, you can use the same XSLT and XSL-FO processors that you use for DocBook 4 documents. You do have a choice of two sets of stylesheets:

  • The original docbook-xsl stylesheet distribution can process DocBook 5 documents, but it first strips the DocBook namespace from the document elements so they match the patterns in the stylesheet.

  • The docbook-xsl-ns stylesheet distribution operates directly in the DocBook 5 namespace for pattern matching on elements.

The two approaches are discussed in the following sections.

Using DocBook 4 stylesheets with DocBook 5

The original DocBook XSL stylesheets written for Docbook 4 documents would not normally work with DocBook 5 documents. That's because a pattern match in an XSL template must match on any namespace as well as the local name of an element. When you process a DocBook 5 document with the original stylesheets, none of the pattern matching templates include the DocBook 5 namespace, so none of the templates will match any elements in the document.

The stylesheets work around this problem with a simple mechanism. If the stylesheet detects a document whose root element is in the DocBook 5 namespace, it first copies the entire document into a variable while stripping out the namespace from all the elements. The stylesheet then converts the variable into a node set, and applies templates to the node set normally. Because the elements in the node set are no longer in the DocBook 5 namespace, its elements will match the patterns in the original stylesheets. All this takes place automatically before the actual processing starts.

The result is that you can process DocBook 5 documents with the same commands as for DocBook 4 documents. You will see these messages indicating what is going on:

Stripping namespace from DocBook 5 document.
Processing stripped document.

The advantages of processing DocBook 5 documents with the original DocBook 4 stylesheets are:

  • You can use your existing customizations.

  • You can process both DocBook 4 and DocBook 5 documents with the same tool chain.

The disadvantages of these stylesheets for DocBook 5 are:

  • The act of copying the content into a node set loses the base URI of the document. This can create problems with relative path references for images and for finding the olink database (if you use olinks).

  • You do not learn anything about namespaces.

Using DocBook 5 stylesheets

The stylesheets included in the docbook-xsl-ns distribution are copies of the original stylesheets, but with the DocBook namespace prefix added to element names in pattern matches and expressions. For example:

<xsl:template match="d:para">
...

You can process DocBook 5 documents using the same commands as for DocBook 4 documents, just substituting the path to the equivalent docbook-xsl-ns stylesheet instead. The behavior of the stylesheets should be identical to the originals.

If you happen to process a DocBook document whose element is without the namespace declaration, the stylesheet does not fail. Rather, it detects that the document does not have the namespace, and preprocesses it to add the namespace to all elements in the document. It uses the same node set trick that the original stylesheets use to strip the namespace from DocBook 5 documents. Generally it will be better to use the original DocBook 4 stylesheets for DocBook 4 documents.

The advantages of processing DocBook 5 documents with the DocBook 5 stylesheets are:

  • You can write customization layers using the DocBook namespace.

  • There is no temporary node set that loses the document URI, which can mess up resolving relative paths in some cases.

The disadvantages of these stylesheets for DocBook 5 are:

  • You cannot use an existing customization layer until you add the DocBook namespace prefix to all element names used in patterns and expressions in the stylesheet.

  • You have to learn something about namespaces.

See the section “Customizing DocBook 5 XSL” for more information about creating customizations for the DocBook 5 stylesheets.

Using XIncludes with DocBook 5

XInclude is an XML inclusion mechanism that is described in Chapter 23, Modular DocBook files. An XInclude uses an href attribute to reference another file that is to be included in the main file. The XInclude may also have a fragment reference to a specific id attribute (DocBook 4) or xml:id (DocBook 5) in the included file in order to include just that specific element from the file.

If you use this mechanism with DocBook 4 files, a fragment reference will fail if the processor cannot load the included file's DTD to confirm that an attribute named id is of attribute type ID, and therefore subject to being referenced. ID is a specific attribute type defined in the XML specification, to be declared in the DTD or schema. If the DTD is not readable for some reason, the processor cannot assume that an attribute named id is of type ID, and the XInclude will fail. Often the error message is not very helpful for determining the cause of the problem.

With DocBook 5, you must use xml:id instead of id to identify each element. This has the advantage that xml:id is predefined by a separate W3C Recommendation to have attribute type ID. Therefore you do not have to worry about fragment references failing due to lack of schema.

What about XSLT 2 and Saxon 8?

You may have heard about XSLT version 2 and the implementation of it in Saxon 8, and you might be wondering how it applies to DocBook 5? The short answer is that it does not yet apply very much, any more than it applies to DocBook 4. But it will in the future, big time.

XSLT 2.0 became a full W3C Recommendation on 23 January 2007. It is the next generation of XSLT processing, with a simpler processing model and a greatly enhanced XPath 2.0 selection language associated with it. Version 8 of the Saxon XSLT processor has long been the only implementation of XSLT 2, since it was essentially used as a test bed by Michael Kay, the editor of the XSLT 2 specification. Now that XSLT 2 is an official Recommendation, many more XSLT 2 processors are expected to become available.

More importantly for DocBook users, Norman Walsh was an active member of the working group that created XSLT 2. Norman Walsh was the original creator of the DocBook XSL stylesheets, and he is working on an XSLT 2 set of DocBook stylesheets. They are not finished, but available as alpha releases for those who want to experiment with XSLT 2. To customize the XSLT 2 stylesheets, you will need to learn XSLT 2 and XPath 2.

But you do not need XSLT 2 to process DocBook 5 documents. The current set of docbook-xsl-ns stylesheet files are written in XSLT 1 and work with existing XSLT 1 processors such as Saxon 6, Xalan, and xsltproc.

The Saxon 8 processor can be used with some XSLT 1 stylesheets, because the XSLT 2 standard defines a backwards compatibility mode. When an XSLT 2 processor such as Saxon 8 sees a version="1.0" attribute in the stylesheet's root element, it switches into that mode. But the backwards compatibility is not complete, so there is no guarantee that the existing DocBook XSL stylesheets will work with Saxon 8. Unless a stylesheet is written in XSLT 2, there is no advantage to using Saxon 8 over Saxon 6.