Lessons learned from the translation of documentation from LaTeX to HTML

Xavier Leroy (INRIA)

Introduction

This presentation reports on the transformation of a 200-pages LaTeX technical documentation into HTML for inclusion in the World Wide Web. We believe that many others who have tried to provide information on the Web have encountered difficulties similar to ours; it is therefore worthwhile to summarize the lessons gained from our experience.

The initial document

The document we worked on is the reference manual and user's documentation for our implementation of the Caml Light programming language. It was originally written in LaTeX, with a good deal of non-standard environments and macros written directly in TeX. Parts of the document were generated: syntactic definitions (typeset from BNF descriptions) and descriptions of library functions (extracted from commented source code).

An HTML version of this document was desirable for several reasons:

Allow browsing of the documentation via the Web for those who wish to learn about the language, yet do not feel like printing the whole thing.
Support word searching in the documentation (via CGI-scripts), which allows experienced programmers to locate specific library functions or command options much faster and much more precisely than a paper index.
Serve as a starting point for providing on-line documentation for the microcomputer ports of Caml Light (through the Windows help system under MSDOS and through DocMaker on the Macintosh). Previous experiments with DocMaker showed that it is nearly impossible to go directly from the DVI file to a decent on-line document. The simplicity and popularity of HTML makes it a good intermediate format.

The Caml Light documentation is about 200 pages long. It was therefore out of the question to rewrite it manually in HTML. Automatic translation was not obvious either, due to the typographical gap between TeX and HTML. For instance, even though the documentation does not contain math formulas properly speaking (no integrals, no greek letters, ...), TeX's math mode is frequently used in BNF definitions and program samples to get metavariables, subscripts, superscripts, ... Typographical enrichments (italics, bold, monospaced fonts) are used extensively for disambiguation (e.g. distinguish terminal symbols from BNF operators) and must be rendered faithfully. The documentation also contains many tables and a few line drawings, which caused some difficulties.

Preparing the LaTeX source

When faced with a LaTeX construct that has no direct HTML equivalent, the latex2html translator simply turns it into a bitmap image and insert that in the produced HTML code. This approach was not acceptable for our purposes:

Word searching ignores the text contained in the bitmaps.
The bitmaps translate poorly to the Macintosh or Windows online documentation systems.
The bitmaps are generally hard to read, since their resolution usually does not match that of the HTML viewer.
Transmission times are increased.

To avoid the recourse to bitmaps and allow the production of decent HTML, we introduced a number of TeX macros in the LaTeX source, to ``abstract over'' the typesetting details and make the semantics of the LaTeX source more explicit. For instance, we wrote \var{x} instead of $x$ to denote the metavariable named x. Similarly, the n-th element of a sequence v was typed \nth{v}{n} instead of $v_n$ . The same technique was also used to eliminate low-level constructs and environments such as center and tabular.

For typesetting with TeX, the new constructs (\var, \nth, etc) were simply macro-defined as their old form, so the end result was the same. For translation to HTML, however, they were specially recognized and the intended meaning was rendered in the best possible HTML way: for instance, \nth{v}{n} becomes <i>v(n)</i>.

The overall approach is basically that of LaTeX for structuring TeX document (instead of typesetting a section header by hand, just write \section{...} to communicate your intentions to the typesetting program), but extended to lower-level elements of the text: metavariables, sequence elements, entries in precedence tables, ...

For the parts of the documentation automatically generated, we modified the programs that generated them in LaTeX to either output HTML directly (embedded in a \begin{rawhtml} ... \end{rawhtml} environment) or output simplified LaTeX that would look bad on paper but translates nicely to HTML.

Finally, a few nasty uses of math mode (arrays of formulas) had to be translated manually to HTML and embedded in the source documentation as \begin{rawhtml} ... \end{rawhtml} blocks, following immediately the original TeX code bracketed by \begin{latexonly} ... \end{latexonly}. The presence of the two versions (LaTeX and HTML) side by side in the source helps keeping them consistent during modifications. About 0.2% of the source had to be manually translated this way; the remaining 99.8% are automatically processed.

Generating HTML

We originally planned to use Nikos Drakos' latex2html translator to produce the final HTML pages. This plan backfired for two reasons:

The definitions of many non-standard commands and environments we use can be communicated to latex2html using simple LaTeX definitions \newcommand and \newenvironment, but a few others (most notably the alltt environment) cannot be defined this way and require modification of latex2html itself. The innards of latex2html are undocumented, scarcely commented, and the Perl code is far from self-explanatory.
The memory requirements of latex2html appear to be superlinear in the size of the input document. When we ran it on the full manual, the process died after having exhausted all the available virtual memory space (about 200M). We then tried to run latex2html independently on each chapter; it still used uncomfortably large amounts of memory (around 100M) on the biggest chapters; moreover, glueing together the HTML pages obtained for each chapter (inter-chapter links, main table of contents) is difficult to do automatically and painful to do by hand.

We therefore had to write our own translator from the extended subset of LaTeX used to HTML. The translator consists in two main parts:

The translator proper reads the whole LaTeX document and outputs a single huge HTML document. It is written in Caml Light, using heavily the lexical analyzer generator camllex (the Caml equivalent to lex for C). Since we did not use Perl but a modern, type-safe high-level programming language with good memory management, our translator has negligible memory requirements, runs quickly, is easy to extend, and took little time to develop.
The output of the translator is then split in smaller pages (at <H1> and <H2> headers, basically). Finally, the pages are ``threaded'' together by inserting ``Next'' and ``Previous'' buttons at the top and bottom of the pages. Two simple Perl scripts do the job adequately.

You may judge the final result and compare it with the DVI file obtained from the original.

The sources of the translator are available here. Some background about the Caml language is helpful to understand them.

Lessons learned

From this experience, we draw the following conclusions:

The HTML language

Despite its apparent simplicity, the HTML language is almost rich enough to format TeX-intensive technical documentation. The sole features that we badly missed were tables, subscripts, and superscripts. None of these is particularly hard to support in HTML viewers, and we hope they will be in the next ``official'' HTML specification. We had no real need for more advanced features such as a full-fledged formula mode.

HTML viewers

We were slightly disappointed by the quality of the typesetting performed by popular HTML viewers such as Mosaic and Netscape. For instance, horizontal spacing is often incorrect when italic characters appear next to roman characters (example: _} -- that's underscore-italic brace). Also, preformatted text is often set up in a smaller font, but italics tags embedded in the preformatted text are in normal size (Mosaic), or in smaller size but in monospaced italics rather than regular italics (Netscape). Both approaches make it difficult to ensure font consistency in the document.

HTML generators and transformers

The quality of publicly available programs that manipulate HTML is clearly insufficient:

They are hard to find due to the lack of well-known repositories. A few catalogues of tools are available on the Web, but incomplete and often outdated.
Many are quick hacks that have been put together in a hurry and made public after very little testing. ``It works for me; have fun!''.
Others (such as latex2html) are more mature and better supported, but still not robust enough. In particular, latex2html and a few others bomb on large inputs.

The immaturity of the field probably explains many of these weaknesses. But we also believe that the widespread use of Perl to program these tools is also partly responsible for the situation. Perl is a fine generalization of the Unix line-oriented tools, and it works great for programs of the kind ``read a line, massage it, print it''. Many CGI scripts fall in this category, but not the more advanced tools we need to produce HTML from other high-level sources. Perl is inherently not suited to the parsing and transformation of structured languages such as LaTeX and HTML. Reading the whole input in a string and rewriting it over and over with s/.../.../eg just does not cut it in terms of clarity and performance; it's like playing the Well-tempered clavier on a Jew's harp. Languages with high-level parsing capabilities, real data structures and clean semantics are clearly needed here.

In which source language should we write documentation?

What is a good formatting language for writing texts that should be available both on paper and on the Web? An idealistic approach would be to design a new text formatting language that would replace both LaTeX and HTML, or at least be straightforwardly translatable to both. A practical approach is to acknowledge that LaTeX is a standard de facto in computer science and that HTML is already hard to modify, which suggests that more effort should be invested in more clever translators from LaTeX to HTML (e.g. capable of translating $x+1$ to <i>x</i>+1).

Xavier Leroy, INRIA (Xavier.Leroy@inria.fr)