Chunking into multiple HTML files

Chunking into multiple HTML files
	Chapter 7. HTML output options

You may want to split the output for a large document into several HTML files. That process is known in DocBook as chunking, and the individual output files are called chunks. The results are a coherent set of linked files, with a title page containing a table of contents as the starting point for browsing the set.

You get chunked output by processing your XML input file with html/chunk.xsl stylesheet file instead of the standard html/docbook.xsl file. For example:

xsltproc  /usr/share/docbook-xsl/html/chunk.xsl  myfile.xml

The default behavior in chunking includes:

The name of the main titlepage/table of contents file is index.html.

Each of the following elements start a new chunk:

appendix
article
bibliography  in article or book
book
chapter
colophon
glossary      in article or book
index         in article or book
part
preface
refentry
reference
sect1        except first
section      if equivalent to sect1
set
setindex

Each chunk filename is generated with an algorithm. It can instead be named after the id attribute value of its starting element, if it has one (see the section “Generated filename”).
A message is displayed for each chunk filename that is generated. If you prefer not to see those messages, then set the chunk.quietly parameter to 1.

Chunk filenames

Each chunk has to have a filename. The filename (before adding .html) can come from three sources, selected in this order:

A dbhtml filename processing instruction embedded in the element.
If it is the root element of the document, then the chunk is named using the value of the parameter root.filename, which is index by default.
The chunk element's id attribute value (but only if the use.id.as.filename parameter is set).
A unique name generated by the stylesheet.

dbhtml filenames

You can embed processing instructions (PI) in your DocBook XML files that instruct the XSL stylesheets what filename to use for a chunk. The following is an example:

<chapter><?dbhtml filename="intro.html" ?>
<title>Introduction</title>
...

The dbhtml name indicates that this processing instruction is intended for DocBook HTML processing. This dbhtml filename processing instruction says that the HTML chunk file for this chapter should be named intro.html. The stylesheet does not add a filename extension when dbhtml filename is used. The processing instruction needs to be an immediate child of the element you are naming, not inside one of its children. For example, it will not work if you put it inside the title element of a chapter. If there is more than one such PI in an element then the first one is used.

id attribute filenames

If the element that starts a new chunk has an id attribute, then that value can be used as the start of the chunk filename. The stylesheet parameter use.id.as.filename controls that behavior. If that parameter is set to a non-zero value, then your chunk filenames will use the element's id attribute. By default, the parameter is set to zero, so you have to turn that behavior on if you want it. For example:

<chapter id="intro">
<title>Introduction</title>
...

This will work for all elements that have an id value and that start a chunk, except for the main index file. By default, that file is named using the value of the root.filename parameter, whose value is index by default. To use your document root element's id as that filename, set the root.filename parameter to blank.

When the id value is used, then the .html filename extension is automatically added. You can change the default extension by setting the html.ext parameter to some other extension, including the dot.

Filename prefix

There may be situations where you need to add a prefix to all the chunk filenames. For example, if you are putting the output for several chunked books into one directory, you could use a different prefix for each book to avoid filename duplication (and subsequent overwritten files).

If you need all of your chunk filenames to include some sort of prefix string, then you can use the base.dir stylesheet parameter. Normally the base.dir parameter is used specify a directory to contain the chunked files, as described in the section “base.dir parameter”. When defining just an output directory with base.dir, you must end the parameter value with a literal / character. If you omit the trailing slash, then the chunk filename is appended to the value without a slash separator, effectively adding it as a prefix to each chunk filename. You can also combine a prefix and a directory name, as shown in the third example below.

base.dir parameter value	Description	Example chunk filename
`base.dir="htmlout/"`	Output directory only.	`htmlout/chap1.html`
`base.dir="refbook-"`	Filename prefix only.	`refbook-chap1.html`
`base.dir="htmlout/refbook-"`	Output directory and filename prefix.	`htmlout/refbook-chap1.html`

Generated filename

If not specified by a PI or id attribute, then the XSL stylesheet will generate a filename. The names are abbreviations of the element name and a count. For example, the first chapter element would be ch01.html, the second chapter would be ch02.html, and so on. The first sect1 in a chapter might be s01.html. But that filename would not be unique if each chapter had a sect1. To make each sect1 name unique, the stylesheet prepends the chapter part. So the first sect1 in the second chapter would be chunked into ch02s01.html. In general, the stylesheet keeps adding parent prefixes to make sure each name is unique. If a document is a set with multiple books, then the stylesheet would also add a book prefix to make a name like bk01ch02s01.html.

The names are not pretty, but they do have a recognizable logic. They are also somewhat stable, as opposed to random number names that might have been used instead. But the filenames may change if the document is edited, because when you insert a chapter, subsequent chapters are bumped up in number. If you are creating a website in which other files refer to these chunk filenames, then they are moving targets unless the document never changes. If you want to point to your generated files, it's best not to use generated filenames, and instead to use one of the other methods to name them. Using the id attribute is the easiest.

Chunked files output directory

The first thing you will notice when you chunk a document is that it can produce a lot of HTML files! Suddenly your directory is very crowded with new HTML files. When chunking, most people choose to place the chunked files into a separate directory.

One method that does not work is to use the processor's --output option. That option is used to redirect the standard output of the processor to a file. During chunking, the stylesheet creates the filenames and files, and also needs to handle the directory location.

base.dir parameter

You inform the stylesheet of the desired directory location using the base.dir parameter. For example, to output the chunked files to the /usr/apache/htdocs directory::

xsltproc --stringparam base.dir /usr/apache/htdocs/ chunk.xsl myfile.xml

Things to watch out for:

Be sure to include that trailing / because the stylesheet simply appends the filename to this string. If you forget the trailing slash, you'll end up with all your filenames beginning with that name. If you need such a filename prefix, then see the section “Filename prefix” for details.
The stylesheets can create files, but some processors will not create directories. Saxon and Xalan will create directories, but xsltproc will not. So create any directories before running xsltproc.
Be aware that the base.dir parameter only works with the chunk stylesheet, not the regular docbook.xsl stylesheet. It does work with the onechunk.xsl stylesheet, though.

dbhtml dir processing instruction

You can also use a dbhtml dir processing instructions to modify where the chunked output goes. For example:

<book><?dbhtml dir="UserGuide" ?>
<title>User Guide</title>
...
<chapter id="intro">
...

This sets the output directory to be UserGuide for the root element chunk and all of its children and descendants (unless otherwise specified). Since this is a relative pathname, the output will be relative to the current directory. So in this example the root element chunk will be UserGuide/index.html, and the first chapter chunk will be in UserGuide/intro.html since it is a child of the book element. Note that the dbhtml dir value does not have a trailing slash because the stylesheet inserts one.

If the base.dir parameter is set, then that value is prepended to the dir value. For example, you could process the above file using:

xsltproc --stringparam base.dir /usr/apache/htdocs/ chunk.xsl myfile.xml

Then the root element chunk will be in /usr/apache/htdocs/UserGuide/index.html. Remember that base.dir does need a trailing slash.

If any of the descendants of the root element also have a dbhtml dir processing instruction, then that value is appended to ancestor value. That means it is relative to its ancestor element's directory. This allows you to build up a longer pathname to divide the output into several subdirectories of the main directory. For example:

<book><?dbhtml dir="UserGuide" ?>
<title>User Guide</title>
...
<chapter id="intro"><?dbhtml dir="FrontMatter" ?>
...
<chapter id="installing">
...
<appendix id="reference"><?dbhtml dir="BackMatter" ?>
...

Now the output chunks will be:

UserGuide/index.html
UserGuide/FrontMatter/intro.html
UserGuide/installing.html
UserGuide/BackMatter/reference.html

Note that the second chapter is not a child of the first chapter, so its directory reverts to that of the book-level PI. Again, if the base.dir parameter is set, then all of these become relative to that value. Remember that you need to create any directories you specify, because the stylesheets will not.

The dbhtml dir processing instruction can be used to specify a full pathname if you do not use a base.dir parameter, but that's not a good idea. That hard codes the path into your file, which means you have to edit the file to put the output elsewhere. Generally this PI is used to create directories relative to some base output directory that you specify on the command line with a parameter. That gives you the flexibility to put the output where you want, yet maintains the relative structure of the subdirectories specified by the PIs.

In all cases, cross references between your chunked files should still resolve, regardless what the relative locations are.

Fast chunking

If you are chunking large documents, then there is a stylesheet variation you can use that will speed up the processing. The caveat is that the XSL processor you are using must support the EXSLT node-set() function. That includes Saxon, Xalan, and xsltproc. It does not include MSXSL, however.

To speed up chunking, use the chunkfast.xsl stylesheet instead of the regular chunk.xsl stylesheet. The chunkfast.xsl stylesheet is a customization of chunk.xsl and is included with the distribution in the html (or xhtml) directory. It handles chunks in a more efficient manner. In the regular chunk.xsl stylesheet, the calculation of the Next and Previous elements for each chunk is performed each time a chunk is output. That calculation requires searching the document using XPath, which can take some time for large documents. When chunkfast.xsl is used instead, those calculations are all done once ahead of time, so that output can proceed without delay.

You may notice that there is a chunk.fast parameter included in the stylesheets. Setting that parameter is not sufficient for getting the correct fast chunking behavior. You have to use the chunkfast.xsl stylesheet in order for the headers and footers to be correct. That stylesheet sets the parameter and customizes some templates.

Table of contents chunking

When chunking a book, the DocBook XSL stylesheets normally put the table of contents (TOC) in the same chunk as the book's title page. The stylesheets provide options for generating separate chunks for the table of contents, and for any lists of titles such as List of Tables.

If you set the stylesheet parameter chunk.tocs.and.lots to 1, then the stylesheet will generate a separate chunk that contains the table of contents and all the lists of titles. The title page chunk will then contain a link to the new chunk. If you also set the parameter chunk.separate.lots to 1, then each of the lists of titles will get a separate chunk as well. If you set only chunk.separate.lots to 1, then your table of contents will appear in the title page chunk, and only the lists of titles will get separate chunks. The chunk.separate.lots parameter was added in version 1.66.1 of the stylesheets.

Note

The chunk.toc parameter does not generate a separate table of contents chunk. Rather, it is used to manually designate chunking boundaries. See the section “Manually control chunking” for more information.

Controlling what gets chunked

There are three options in the DocBook XSL stylesheets for controlling what gets chunked:

Set the parameters chunk.section.depth and/or chunk.first.sections.
Chunk based on a manually edited table of contents file.
Modify the chunk template.

If you only want to control what section levels get put into separate HTML files, then you should set the chunk.section.depth parameter. By default it is set to 1. So if you want sect1 and sect2 elements to be chunked into individual files, set the parameter to 2.

The chunk stylesheet by default includes the first sect1 of a chapter (or article) with the content that precedes it in the chapter. If you want those also to be chunked to separate files, then set the chunk.first.sections parameter to 1.

Manually control chunking

If the standard chunking process does not meet your needs, and you are willing to manually intervene, then you can completely control how content gets chunked. This might be useful if some sections are very short and you would rather keep them together. But since it requires hand editing of a generated table of contents file, it is only useful if done infrequently or with documents that have stable structure.

Here are the steps for manually chunking HTML output:

Make sure all the elements you want to become chunks have an id attribute on them.
Process your document with the special maketoc.xsl stylesheet, which generates an XML table of contents file. Using xsltproc for example:
```
xsltproc  -o mytoc.xml \
  --stringparam chunk.section.depth 8 \
  --stringparam chunk.first.sections 1 \
  html/maketoc.xsl  myfile.xml
```
The two parameters ensure that all sections are included in the generated TOC file.
Edit the generated mytoc.xml file to remove any tocentry elements that you do not want chunked, or add entries that you do want chunked.
Process your document with the special chunktoc.xsl stylesheet instead of the regular chunk.xsl stylesheet, and pass it the generated TOC filename in the chunk.toc parameter. For example:
```
xsltproc  
   --output  output/  \ 
   --stringparam chunk.toc  mytoc.xml  \
   html/chunktoc.xsl  myfile.xml
```
This will chunk your document based on the entries in the generated TOC file. You can still use any of the chunking parameters to modify the chunking behavior.
If you also want the HTML TOC that is produced during chunking to match your XML TOC file, then set the parameter manual.toc to that same filename.

Note

When you use this process, you must have an id attribute on every element that you want to start a new chunk. This includes the document element, which generates the title page and table of contents. You can see which elements do not have an id by examining the generated TOC file and looking for empty id attributes in the tocentrys. Any such entries will be merged with their parent elements during chunking.

Modify chunking templates

If you want to control what elements produce chunks, beyond just the section level choice, then you must modify the templates that do chunk processing. See the section “Chunking customization” for more information.

Output encoding for chunk HTML

You may need to change the output encoding for your chunked HTML files. The chunker.output.encoding parameter lets you change the default value of the HTML character encoding from the default value of ISO-8859-1. For example, if you want your HTML files to use UTF-8 encoding instead, you could process your document with the following:

xsltproc  
  --output  output/  \ 
  --stringparam chunker.output.encoding UTF-8  \
   html/chunk.xsl  myfile.xml

This will produce the following line in each chunked HTML file:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

It will also encode the HTML content itself using UTF-8 encoding. When a browser opens the file, the meta tag informs it that the file is encoded in UTF-8 so it will use a UTF-8 font to display the text. This feature is only available with Saxon and XSL processors that support EXSLT extensions (such as xsltproc). It does not work in Xalan, however.

Note

By default, chunked HTML output from Saxon will not contain any non-ASCII characters, regardless of the encoding your specify. Any non-ASCII characters will be represented as named entities or numerical character references. This behavior is controlled by the saxon.character.representation stylesheet parameter. See the section “Saxon output character representation” for more information.

The default output encoding for XHTML is UTF-8, as described in the section “XHTML”.

Specifying the output DOCTYPE

You may want to specify a particular DOCTYPE at the top of your chunked HTML files. This is most useful for XHTML output where you may want to validate the chunked files against the DTD.

There are two stylesheet parameters for the chunking stylesheet that affect the DOCTYPE:

chunker.output.doctype-public: Specifies the PUBLIC identifier of the DTD in the DOCTYPE.
chunker.output.doctype-system: Specifies the SYSTEM identifier of the DTD in the DOCTYPE.

See the section “Generating XHTML” for an example of using these parameters. Note that they do not work with the Xalan processor because it uses a different way of writing chunk files.

Unfortunately, there is no way to add an internal subset to the output DTD using XSLT. If you do not know what an internal DTD subset is, then you probably do not need it. See a good XML reference for more information.

Indenting HTML elements

If you use a text editor to open an HTML file produced by DocBook XSL, you will notice that by default it produces long text lines that contain many elements. If you would prefer your HTML elements to start on a new line and have nested indents to show the HTML element structure, you can do that by setting the chunker.output.indent parameter to yes. Note that this feature is only available with XSL processors that support EXSLT extensions, but that includes most of the major ones. Xalan does not support this indenting option.

There are limits to which HTML elements can start an indented line. In general, any element that permits #PCDATA (plain text) as part of its content model will not allow the extra line breaks inside it. That is because white space must be respected inside such elements, and that respect includes not adding extra white space.

To add indentation with the non-chunking docbook.xsl stylesheet, you need to use a customization layer with an xsl:output element similar to the example in the section “Output encoding”. Use the indent="yes" attribute value to turn on indentation. The other approach for single-file output is to use the onechunk.xsl stylesheet and its extra parameters, as described in the section “Single file options with onechunk”.


Processing part of a document		Single file options with onechunk