\documentclass{article}

\def\reason#1{{\footnotesize{\bf Reason:} #1}}

\begin{document}
\title{The Next Generation {\tt latex2html}---Why and~What}
\author{Marcus E. Hennecke\\{\tt hennecke@dbag.ulm.daimlerbenz.com}}
\maketitle

\begin{abstract}
As both \LaTeX\ and the Web progress, so \verb|latex2html| needs to
evolve and change according to the needs of its users. However,
certain design decisions that were made while writing the code now
have a negative effect on the further development of the package. In
order to overcome these difficulties, this paper proposes a next
generation \verb|latex2html|, which reimplements a lot of the
functionality of the old \verb|latex2html|, but in very different
ways.
\end{abstract}

\section{Introduction}
In the spring of 1996, Michael Downes \verb|<mjd@math.ams.org>| and I
(at the time doctoral student at Stanford) discussed some of the
weaker points of \verb|latex2html| and how some of them stem directly
from the underlying architecture of the converter. It soon became
apparent that in order to overcome these difficulties, many parts of
\verb|latex2html| needed to be rewritten and the way it did the
conversion needed to be changed. As a result of these discussions, I
wrote \verb|latex2html-ng|, based on \verb|latex2html| 96.1.

This paper reports on the development of \verb|latex2html-ng|, the
reasons why some of the changes were made and also lists in general as
well as more specifically which changes were made.

\section{The Main Goals}
There were three important driving forces behind the development of
the next generation \verb|latex2html|:
\begin{enumerate}
\item Make \verb|latex2html| behave more like \LaTeX.
\item Make \verb|latex2html| more portable across platforms.
\item Make \verb|latex2html| more efficient.
\end{enumerate}
The goals are listed in the order of importance and indeed, the most
radical changes are due to the first goal, matching the behaviours of
\verb|latex2html| and \LaTeX more closely. This includes such
fundamental features as real scoping and goes down to more subtle
features such as being able to specify a closing bracket in an
optional argument as in:
\begin{verbatim}
\section[The {]} bracket]{A section about brackets}
\end{verbatim}
One of the most important steps towards this end was the elimination
of both \verb|texexpand| and \verb|pre_process|. The script now works
on the raw \LaTeX\ code instead.

A further goal was to make \verb|latex2html| more portable. The main
idea behind this is that in principle, the only thing non-portable in
the translation process is the image generation. In the ideal case,
the \verb|latex2html| script itself should work on all platforms
unchanged. Only the image generation routines need porting and should
be provided in a separate module.

Efficiency was not as important a goal as the other two, but since
some members of the mailing list had complained about its memory
useage, some changes were made to \verb|latex2html| that improve
memory efficiency. Memory is saved mainly by not expanding the whole
document before translation (i.e., \verb|\input| and \verb|\include|
are processed one at a time and only when they are called) and by not
copying the entire code each time a command procedure is called.

\section{General Description of the Changes}
All of the changes to \verb|latex2html| were made based on version
96.1, rev d. In addition, many of the changes have altered the way
things work internally. This has a number of disadvantages:
\begin{itemize}
\item The way \verb|do_cmd_| and \verb|do_env_| subroutines are called
  and the way arguments are retrieved has been changed. This means
  that old \verb|.perl| files will no longer work and need to be
  rewritten.

\item Requires Perl 5.001 or newer due to some bugs triggered in both
  Perl 4.* and Perl 5.000. This move also makes it possible to use
  some of the more advanced features of Perl 5 such as real local
  variables (\verb|my(...)|), lists of lists, regular expressions that
  really match only at the beginning or end of the string, etc.

\item Probably many new bugs introduced and features broken. Extensive
testing is required here. Thus, one should probably provide two
versions for some time.

\item Many of the features in 97.1 are not yet implemented.
\end{itemize}

However, in general, the advantages of \verb|latex2html-ng| should
greatly outweigh the disadvantages.

Most of the changes made \verb|latex2html| behave more closely to the
way \TeX/\LaTeX\ behave. This in turn means that a lot of things now
work correctly or that things that required awkward workarounds are
now straight forward. Also, the HTML generated is much closer to the
standard and more likely to validate.

Generally, \verb|latex2html| now works directly on the raw \LaTeX\
code. This means no more preprocessing of any kind. \verb|texexpand|
is no longer needed and since the file is no longer split, no DBM
routines are required (which caused problems for some). The code is
processed in sequential order, instead of innermost environment
first. Definitions and newcommands are processed when they occur so
that real scoping is possible (i.e., a definition is local to its
scope). Mixing \verb|verbatim| environments with \verb|\input|
commands is trivial. Since no preprocessing occurs, the original code
is always available for image generation.

\section{List of Changes}
Here is a more detailed list of changes (this covers most of the more
important ones):
\begin{itemize}
\item No more preprocessing and no more use of \verb|texexpand|. All
  translation is done on the raw \LaTeX\ code. While expanding and
  preprocessing the code has some benefits for further processing, it
  is a highly complex task due to the many quirks and exceptions
  possible in \LaTeX\ (e.g., \verb|\input| inside \verb|figure|, weird
  combinations of comments, verbatim environments, etc.).

  \reason{Improves efficiency by not expanding the code before
  translation and emulates \LaTeX\ behaviour more closely.}

\item Sequential processing of \LaTeX\ commands. This is probably one
  of the most profound changes to \verb|latex2html|. Previously, the
  code was translated environments first, starting with the innermost
  environments, regular commands second. This is vastly different from
  the way \LaTeX\ works, which processes commands in sequential order
  as they appear in the source. Because of this difference, some
  simple \LaTeX\ constructs could not be translated into proper
  HTML. For example:
\begin{verbatim}
	{\small\begin{tabular}{...}...\end{tabular}}
\end{verbatim}
  Since the \verb|tabular| environment was processed first, the
  \verb|\small| command could not have an effect on the table.

  Now the code is processed in sequential order just like \LaTeX\ does
  it. Furthermore, there is no real distinction anymore between
  environments and commands.

  \reason{Matches behaviour more closely.}

\item Major change in the way \verb|do_cmd_| subroutines are
  used. Subroutines are called with two parameters: The first one
  contains all \LaTeX\ code following the command (as
  before). However, it is strongly recommended that programmers not
  use \verb|local($text) = @_;| or somesuch to get at the
  text. Instead, this parameter should always be called by reference,
  i.e. by using \verb|$_[0]|. This is for several reasons:
  \begin{enumerate}
  \item \verb|local($text) = @_;| just makes another copy of the
    (possibly long) text following the command, wasting both time and
    memory. In contrast, accessing \verb|$_[0]| directly is a call by
    reference and requires no copying.
  \item If the subroutine needs to access an argument, it has to be
     removed from the following text. This however, is not possible if
     a copy of the text is used.
  \end{enumerate}
  The second parameter contains the exact piece of \LaTeX\ code that
  triggered the call to the subroutine. For example,
  \verb|do_cmd_bgroup| may be called both by `\verb|{|' as well as
  `\verb|\bgroup|'.

  The return value of \verb|do_cmd_| subroutines is also different. It
  used to be the "value" of the \LaTeX\ command expressed as HTML
  markup plus the entire following text. However, this meant yet
  another copy operation of the (possibly large) \LaTeX\
  code. Instead, the return value now only contains the HTML markup
  produced by the command itself, nothing more.

  \reason{The main reason for this was to improve efficiency.}

\item Arguments are now *always* retrieved by using
  \verb|get_next_argument| or \verb|get_next_optional_argument| which
  {\em must} be called with \verb|$_[0]| as parameter so that the
  argument can be removed from the following text. Example:
\begin{verbatim}
    my($arg,$pat) = &get_next_argument($_[0]);
\end{verbatim}
  These subroutines are now fully brace aware. That is, you can use 
  brackets in optional arguments:
\begin{verbatim}
	\section[The {]} bracket]{A section on closing brackets}
\end{verbatim}

  \reason{This change makes {\tt latex2html} behave more like \LaTeX.}

\item Environments are now processed just like any other commands. That
  is, there are now \verb|do_cmd_begin| and \verb|do_cmd_end|. This is
  how it works: When an environment opens, the \verb|process_command|
  subroutine sees the \verb|\begin| command. It then calls the
  corresponding \verb|do_cmd_begin| routine. This routine in turn
  reads the next argument, which gives it the environment name. It
  then opens a new scope, remembers the environment name (for error
  checking) and calls the corresponding \verb|do_env_begin_|
  routine. Similarly, at the end of the environment, \verb|do_cmd_end|
  is called which in turn calls \verb|do_env_end_| and closes the
  scope. Consequently, there now need to be two subroutines per
  environment, not just one, called \verb|do_env_begin_*| and
  \verb|do_env_end_*|.

  \reason{This goes a very long way towards matching behaviours.}

\item Real grouping and local declarations and definitions. If you define
  a macro, its scope is limited to the local group. This is done via
  \verb|do_cmd_bgroup| and \verb|do_cmd_egroup| and the use of an execute
  stack. What happens is that every time a declaration or definition
  is processed, a piece of perl code is added to the execute stack that
  reverses the declaration or definition. At the end of the group, the
  code on the stack is executed so that the same declarations and
  definitions as before the group are in effect again.

  For example, a variable might contain a declaration such as the
  current alignment (e.g., \verb|ALIGN=CENTER|). When a group is
  entered, the execute stack is shifted and an empty element is added
  to it which becomes the current element. Suppose that inside the
  group the alignment is changed. Then the routine that made the
  change will also add some perl code to the current element that
  reverses that change. When the end of the group is reached, the
  current element simply needs to be popped off the stack and executed
  and the value of the variable has returned to what it was before the
  group.

  \reason{This also goes a very long way towards matching behaviours.}

\item All grouping is now copied to the \verb|images.tex| file as well
  as all definitions and declarations. That is, every \verb|{|, every
  \verb|}|, every \verb|\bf|, every \verb|\def|, etc. is copied over
  to the \verb|images.tex| file so that definitions and declarations
  have the same effects on images as they do in the original LaTeX
  code.

  In addition, the preamble text is copied to \verb|images.tex|
  unmodified.

  \reason{The resulting images correspond better to what they should
  look like in the context of the whole document.}

\item Support for \verb|ALT| text in images (e.g. \LaTeX\ code for
  math equations, more useful footnote \verb|ALT| text).

  \reason{Better support for text-only browsers and users that have
  turned off image loading.}

\item Support for trial mode: Certain math equations can be translated
  without the use of images. In trial mode, translation is
  attempted. If successful, the translation is used. Otherwise, the
  equation is converted to an image. In particular, this solves the
  problem of the \verb|$\backslash$| command.

  \reason{Improves efficiency by not converting small \LaTeX\ snippets
  to images.}

\item No more use of DBM (one source of problems under Linux).

  \reason{May improve portability.}

\item Support for catcodes. Originally, this was introduced to implement
  the \verb|\makeatletter| and \verb|\makeatother| commands. However,
  it also makes support for the \verb|german| and \verb|alltt| styles
  trivial. You can also write:
\begin{verbatim}
	\catcode`\%=12 This: % is no longer a comment.
\end{verbatim}

  \reason{Matches behaviour of \LaTeX\ and {\tt latex2html}}

\item Comments are now translated into HTML comments. For example,
\begin{verbatim}
	This: % is a comment
\end{verbatim}
  is translated to
\begin{verbatim}
	This: <!-- is a comment -->
\end{verbatim}
  Note that the way arguments are retrieved has a side effect on
  comment parsing:
\begin{verbatim}
	\mbox{\catcode`\%=12 This: % is still a comment.}
\end{verbatim}

  That is, even though the \verb|%| character is defined to be a
  regular character, it is still treated as a comment
  character. \LaTeX\ behaves the same way, that is it also sees this
  as a comment. In other words, \verb|latex2html|'s comment parsing is
  now somewhat closer to that of \LaTeX.

  \reason{May make for better reading of the HTML code.}

\item Number of system calls reduced to a minimum. This is done for
  portability reasons. System calls generally do not port well to a
  Mac, PC, or to VMS. The only remaining non-portable calls should now
  reside in the image generation routines. The hope is that as
  development on \verb|latex2html| progresses, other platforms can use
  the updated versions unmodified.

  \reason{Improves portability.}

\item Font declarations now finally work right. The routines have been
  completely rewritten. \verb|Latex2html| now keeps track of the
  current font specification. Whenever something is changed (e.g.,
  size, boldness, italics, etc.), a subroutine is called, which looks
  at the current and the desired font specifications and determines
  the proper HTML code to do the switch. The same subroutine is called
  at the end of the scope to return back to the old specifications.

  \reason{Matches behaviours.}

\item The produced HTML code is closer to standard HTML. For example,
  font commands no longer extend beyond the end of the paragraph (the
  following is illegal: "\verb|<B>Par 1<P>Par 2</B>|").

  \reason{Improve validity of generated HTML code.}

\item Full support for TeX-style \verb|\def|. This does include
  patterns. For example:
\begin{verbatim}
	\def\foo(#1#2){1st par: '#1', 2nd par: '#2'}
	Do you know what will happen here: \fo(145)?
	How about here: \fo({14}5)?
\end{verbatim}
  Since \verb|\def| is now better supported, \verb|texdefs.perl| has
  been merged with the main \verb|latex2html| script.

  \reason{Matches behaviours.}

\item Better at converting accents. If an accented character is not
  available but both the accent and the character are, a two character
  workarounds is used. For example, \verb|\=A| (i.e., \=A) would be
  converted to \verb|&Amacr;| if that entity were available. Since it
  isn't (at least in HTML 2.0), but both the accent and the character
  are, it is converted to \verb|&macr;A| (i.e., \={}A) instead.

  \reason{Increases efficiency by not necessarily converting unknown
  accents to images.}

\item Lots more comments in the perl code.

  \reason{This was really necessary if others wanted to participate in
  the development.}

\item Much better support for counters (now done in HTML). Two new
  hashes, called \verb|%counter_values| and \verb|%counter_within|
  are used to manage the counters in \verb|latex2html|. The first hash
  contains the counter values. The second hash takes care of counter
  dependencies. Any counter definitions and changes to the counters
  are also logged to the \verb|images.tex| file so they can affect the
  images as well.

  \reason{Matches behaviours.}

\item Much better support for the \verb|list| environment
  (\verb|\usecounter| supported). Since counters are now supported, it
  was also possible to support \verb|list| more fully. Example:
\begin{verbatim}
	\begin{list}{B--\Roman{bean}}{\usecounter{bean}
	\setlength{\rightmargin}{\leftmargin}}
	\item This is the first item of the list.
	        Observe how the left and ...
	\item This is the second item.
	\end{list}
\end{verbatim}
  Of course the margin commands are ignored, but the list items are
  properly numbered with leading 'B--' followed by the corresponding
  uppercase roman numeral.

  \reason{Matches behaviours.}

\item Behaviour of \verb|\label| closer to the \LaTeX\ behaviour:
  Depending on whether \verb|\ref| or \verb|\pageref| is used, it can
  now point to either the current section title or the exact location
  of the \verb|\label| command. This is done similar to the way
  \LaTeX\ works by keeping track of a current reference name and
  reference value. These are set by section and counter commands as
  well as certain environments (e.g. \verb|figure|, \verb|equation|).

  \reason{Matches behaviours.}

\item Image generation now makes use of information extracted from the
  log file. That is, during writing of the \verb|images.tex| file, a
  bunch of \verb|\lthtmltypeout| commands are interspersed that write
  information to the log file such as the current page numbers, the
  names and sizes of the boxes containing the image generating code,
  current counter values, and image parameters from the
  \verb|\htmlimg| command. This makes the process more robust to
  images that extend over more than one page and has other advantages
  (e.g. in case the image generation routines on some other platform
  are not written in perl, all the necessary information can be
  extracted from the log file).

  \reason{Increase robustness of image generation.}

\item Better support for links to sections and images (i.e., if a
  \verb|\label| is part of an image, the image is embedded inside an
  anchor instead of adding an "empty" anchor before the image).

\item Different platforms use different directory delimiters in their
  pathnames. Similarly, in order to separate paths in path lists,
  different path delimiters must be used. For example, under Unix, the
  directory delimiter is '/' and the path delimiter is ':', whereas on
  a Mac the directory delimiter is ':' and the path delimiter is
  ','. In order to improve portability, two new variables are
  introduced, called \verb|$dd| and \verb|$pd|. They should always be
  used when files are referenced by their full path name.

  \reason{Increases portability.}

\item The elements of the \verb|%section_info| and
  \verb|%toc_section_info| hashes are now proper lists, making
  \verb|post_process| more efficient.

  \reason{Increase efficiency.}
\end{itemize}

\section{Future work and conclusions}
The original goal for \verb|latex2html-ng| was to wait for 96.2 to
come out and then make it 96.3. However, development of 96.2 took much
longer than expected and it also took on many more features than {\em
I} had expected (I had thought it would be more of a bug-fix
release). Unfortunately, this has several negative effects on
\verb|latex2html-ng|. First of all, since 96.2 now has many features
that {\tt -ng} does not, it is somewhat behind. Instead of becoming
version 96.3 or 97.2, it will have to be a parallel release until it
has caught up with 97.1 (i.e., is as stable and has at least the same
features as the `regular' \verb|latex2html|). Second, since I am no
longer a doctoral student, I have only very little time to spare to
work on it and hope that others will be able to continue my work or at
least make use of it for future versions.

There is a lot that remains to be done. First and foremost should be
fixing the bugs and getting \verb|latex2html-ng| up to par with
97.1. For example, in the very latest version, image generation is not
working since I have been toying with a few new ideas. I will need to
fix this before I can release the code. More profound things that I
think should be defined and implemented: It should be easier to turn
on/off certain HTML features such as i18n, tables, math (math in
certain levels: no math, just sub/superscripts or the full thing),
maybe frames if absolutely necessary. Support for stylesheets would
also be nice; we would have to think about how stylesheets will be
integrated (e.g., maybe there should be different stylesheets for each
of the main \LaTeX\ styles \verb|article|, \verb|report|, and
\verb|book|) and how users will be able to customize stylesheets
either from within \LaTeX\ (e.g., by setting margins, colors, fonts,
etc.) or via appropriate perl code.
\end{document}
