Lessons learned from the translation of documentation
from LaTeX to HTML
Xavier Leroy (INRIA)
Introduction
This presentation reports on the transformation of a 200-pages LaTeX
technical documentation into HTML for inclusion in the World Wide Web.
We believe that many others who have tried to provide information on
the Web have encountered difficulties similar to ours; it is therefore
worthwhile to summarize the lessons gained from our experience.
The initial document
The document we worked on is the reference manual and user's
documentation for our implementation of the Caml Light programming
language. It was originally written in LaTeX, with a good deal of
non-standard environments and macros written directly in TeX. Parts of
the document were generated: syntactic definitions (typeset from BNF
descriptions) and descriptions of library functions (extracted from
commented source code).
An HTML version of this document was desirable for several reasons:
- Allow browsing of the documentation via the Web for those who
wish to learn about the language, yet do not feel like printing the
whole thing.
- Support word searching in the documentation (via CGI-scripts),
which allows experienced programmers to locate specific library
functions or command options much faster and much more precisely than
a paper index.
- Serve as a starting point for providing on-line documentation for
the microcomputer ports of Caml Light (through the Windows help system
under MSDOS and through DocMaker on the Macintosh). Previous
experiments with DocMaker showed that it is nearly impossible to go
directly from the DVI file to a decent on-line document. The
simplicity and popularity of HTML makes it a good intermediate format.
The Caml Light documentation is about 200 pages long. It was therefore
out of the question to rewrite it manually in HTML. Automatic
translation was not obvious either, due to the typographical gap
between TeX and HTML. For instance, even though the documentation does
not contain math formulas properly speaking (no integrals, no greek
letters, ...), TeX's math mode is frequently used in BNF definitions
and program samples to get metavariables, subscripts, superscripts,
... Typographical enrichments (italics, bold, monospaced fonts)
are used extensively for disambiguation (e.g. distinguish terminal
symbols from BNF operators) and must be rendered faithfully. The
documentation also contains many tables and a few line drawings, which
caused some difficulties.
Preparing the LaTeX source
When faced with a LaTeX construct that has no direct HTML equivalent,
the latex2html translator simply turns it into a bitmap
image and insert that in the produced HTML code. This approach was not
acceptable for our purposes:
- Word searching ignores the text contained in the bitmaps.
- The bitmaps translate poorly to the Macintosh or Windows online
documentation systems.
- The bitmaps are generally hard to read, since their resolution
usually does not match that of the HTML viewer.
- Transmission times are increased.
To avoid the recourse to bitmaps and allow the production of decent
HTML, we introduced a number of TeX macros in the LaTeX source, to
``abstract over'' the typesetting details and make the semantics of
the LaTeX source more explicit. For instance, we wrote
\var{x}
instead of $x$
to denote the
metavariable named x. Similarly, the n-th
element of a sequence v was typed \nth{v}{n}
instead of $v_n$
. The same technique was also used to
eliminate low-level constructs and environments such as
center
and tabular
.
For typesetting with TeX, the new constructs (\var
,
\nth
, etc) were simply macro-defined as their old form,
so the end result was the same. For translation to HTML, however, they
were specially recognized and the intended meaning was rendered in the
best possible HTML way: for instance, \nth{v}{n}
becomes
<i>v(n)</i>
.
The overall approach is basically that of LaTeX for structuring TeX
document (instead of typesetting a section header by hand, just write
\section{...}
to communicate your intentions to the
typesetting program), but extended to lower-level elements of the
text: metavariables, sequence elements, entries in precedence tables,
...
For the parts of the documentation automatically generated, we
modified the programs that generated them in LaTeX to either output
HTML directly (embedded in a \begin{rawhtml}
... \end{rawhtml}
environment) or output simplified LaTeX that
would look bad on paper but translates nicely to HTML.
Finally, a few nasty uses of math mode (arrays of formulas) had to be
translated manually to HTML and embedded in the source documentation
as \begin{rawhtml} ... \end{rawhtml}
blocks, following
immediately the original TeX code bracketed by \begin{latexonly}
... \end{latexonly}
. The presence of the two versions (LaTeX
and HTML) side by side in the source helps keeping them consistent
during modifications. About 0.2% of the source had to be manually
translated this way; the remaining 99.8% are automatically processed.
Generating HTML
We originally planned to use Nikos Drakos'
latex2html
translator to produce the final HTML pages. This plan backfired for
two reasons:
- The definitions of many non-standard commands and environments we
use can be communicated to latex2html using simple LaTeX
definitions
\newcommand
and \newenvironment
,
but a few others (most notably the alltt
environment)
cannot be defined this way and require modification of
latex2html itself. The innards of latex2html
are undocumented, scarcely commented, and the Perl code is far from
self-explanatory.
- The memory requirements of latex2html appear to be
superlinear in the size of the input document. When we ran it on the
full manual, the process died after having exhausted all the available
virtual memory space (about 200M). We then tried to run
latex2html independently on each chapter; it still used
uncomfortably large amounts of memory (around 100M) on the biggest
chapters; moreover, glueing together the HTML pages obtained for each
chapter (inter-chapter links, main table of contents) is difficult to
do automatically and painful to do by hand.
We therefore had to write our own translator from the extended subset
of LaTeX used to HTML. The translator consists in two main parts:
- The translator proper reads the whole LaTeX document and outputs
a single huge HTML document. It is written in Caml Light, using
heavily the lexical analyzer generator
camllex
(the Caml
equivalent to lex
for C). Since we did not use Perl but a
modern, type-safe high-level programming language with good memory
management, our translator has negligible memory requirements, runs
quickly, is easy to extend, and took little time to develop.
- The output of the translator is then split in smaller pages (at
<H1>
and <H2>
headers,
basically). Finally, the pages are ``threaded'' together by inserting
``Next'' and ``Previous'' buttons at the top and bottom of the pages.
Two simple Perl scripts do the job adequately.
You may judge
the final result
and compare it with the
DVI file
obtained from the original.
The sources of the translator are available here.
Some background about the Caml
language is helpful to understand them.
Lessons learned
From this experience, we draw the following conclusions:
The HTML language
Despite its apparent simplicity, the HTML language is almost rich
enough to format TeX-intensive technical documentation. The sole
features that we badly missed were tables, subscripts, and
superscripts. None of these is particularly hard to support in HTML
viewers, and we hope they will be in the next ``official'' HTML
specification. We had no real need for more advanced features such as
a full-fledged formula mode.
HTML viewers
We were slightly disappointed by the quality of the typesetting
performed by popular HTML viewers such as Mosaic and Netscape.
For instance, horizontal spacing is often incorrect when italic
characters appear next to roman characters (example: _} --
that's underscore-italic brace).
Also, preformatted text is often set up in a smaller font, but italics
tags embedded in the preformatted text are in normal size (Mosaic), or
in smaller size but in monospaced italics rather than regular italics
(Netscape). Both approaches make it difficult to ensure font
consistency in the document.
HTML generators and transformers
The quality of publicly available programs that manipulate HTML is
clearly insufficient:
- They are hard to find due to the lack of well-known
repositories. A few catalogues of tools are available on the Web, but
incomplete and often outdated.
- Many are quick hacks that have been put together in a hurry and
made public after very little testing. ``It works for me; have fun!''.
- Others (such as latex2html) are more mature and
better supported, but still not robust enough. In particular,
latex2html and a few others bomb on large inputs.
The immaturity of the field probably explains many of these
weaknesses. But we also believe that the widespread use of Perl to
program these tools is also partly responsible for the situation.
Perl is a fine generalization of the Unix line-oriented tools, and it
works great for programs of the kind ``read a line, massage it, print
it''. Many CGI scripts fall in this category, but not the more
advanced tools we need to produce HTML from other high-level sources.
Perl is inherently not suited to the parsing and transformation of
structured languages such as LaTeX and HTML. Reading the whole input
in a string and rewriting it over and over with s/.../.../eg
just does not cut it in terms of clarity and performance; it's like
playing the Well-tempered clavier on a Jew's harp.
Languages with high-level parsing capabilities, real data structures
and clean semantics are clearly needed here.
In which source language should we write documentation?
What is a good formatting language for writing texts that should be
available both on paper and on the Web? An idealistic approach would
be to design a new text formatting language that would replace both
LaTeX and HTML, or at least be straightforwardly translatable to both.
A practical approach is to acknowledge that LaTeX is a standard de
facto in computer science and that HTML is already hard to modify,
which suggests that more effort should be invested in more clever
translators from LaTeX to HTML (e.g. capable of translating
$x+1$
to <i>x</i>+1
).
Xavier Leroy, INRIA (Xavier.Leroy@inria.fr)