Andrew Stacey


About
Andrew Stacey
Information about my research, teaching, and other interests.

By: Andrew Stacey
Contact details


Andrew Stacey


blosxom icon


Fri, 29th Jul 2011 (HowDidIDoThat :: LaTeX)

The internet Class

This is a class designed for those that want to use LaTeX to publish material on the internet. As it becomes more common to publish material via some content-management system, so it becomes rarer to generate (X)HTML documents directly and content for the web is usually written in some limited markup language. This class is designed to make it possible to author material for such systems using the facility of LaTeX.


Introduction

It is important at the outset to know what this class is designed for and the easiest way to do that is to explain what it is not for. It is not intended as a way to take an already authored LaTeX document and convert it into a format suitable for putting on the internet. It is also not intended as a "plugin" for a content-management system so that authors can use LaTeX as the input format for their blog, forum, wiki, or whatever.

The first of these is both impossible and undesirable. It is impossible for the following simple reason. Because TeX controls the whole process of creating its output, it can do some amazing things. A web document, on the other hand, is extremely malleable and the reader can transform it considerably; therefore, not everything that TeX can do can be (easily) done on the web. It is, via a considerable amount of trickery, to get quite close but only at the expense of this malleability. And this flexibility is a good thing as it allows the reader to make the document as easy for them to access as they can. This is why it is undesirable to support absolutely anything that LaTeX can do.

The second, the plugin, is also undesirable. There is a good reason that the current input formats were chosen: they are simple. They are simple to understand by a human: when writing on a wiki, it would be extremely inconvenient to have to learn the current page's set of macros so there is an advantage to having them consistent across the whole system. They are also simple to understand by a computer: parsing a file in, say, markdown is extremely quick and can be done in real time by scripting languages such as Perl and PHP. Parsing TeX is much harder due to its complexity, and so either the scripting language has to limit the syntax or it has to pass it to an external program, both of which can put severe limitations on the system.

So this package is designed for someone who wants to write a web page, not necessarily directly, using their LaTeX skills. They want the full flexibility of LaTeX together with its familiarity (presumably they write other documents in LaTeX already) but know at the outset that the document will end up on the web.

Usage

Usage of this package is extremely simple. It is designed as a class and so should be loaded with a simple:

\documentclass{internet}

There are various options available for the class (these are processed internally by the pgfopts package, which needs to be installed). The main option is to specify the type of output. The types of output can be classified in to two groups. The first group consists of the output system; that is, the software program that will be used to present the material. The second group consists of the output format. In common use, one would specify a system, which would then load the required format (and define a few extra bits and pieces). At the moment, there are the following possibilities for the system.

  • latex This is the default and processes the document as if it were a LaTeX document.

  • instiki This is formatted suitable for the instiki wiki system.

  • wordpress This is formatted suitable for the wordpress blog system.

  • blosxom This is formatted suitable for the blosxom blog system. (Warning: My whole website is built using blosxom and I use a lot of plugins there; I can't remember all of the modifications that I've made to them so the blosxom option should be used with caution!)

The formats are as follows.

  • markdown The Markdown format.

  • markdownextra The Markdown Extra format (built on top of Markdown).

  • maruku This is the Ruby implementation of Markdown, which extends Markdown Extra.

There is also the matter of putting mathematics on the web. Some systems load a particular method by default, but this can be overridden using the maths mode=mode class option. Currently supported are the following.

  • itex This formats the mathematics suitable for the itextomml system for producing MathML.

  • basic This does a little formatting within the confines of ordinary HTML.

One File, Many Outputs

Although the intention is that a document written using this system be written knowing the eventual output, it is certainly not unreasonable to use a single file for several outputs, or to use a single fragment in documents on different systems. Whilst the attempt is to make it so that the same input works for all outputs, there will be times when one output type requires a slightly different input to another. For that situation, the imode command is provided. It works in much the same fashion as the beamer command \mode (indeed, the code is taken from that package). The syntax is either \imode<mode>{stuff} or just \imode<mode>, the latter must be by itself on a line. In both cases, mode is one of the possible outputs.

Let us take \imode<mode>{stuff} first. If mode matches what was specified as a class option, then the contents of the argument is executed. If not, it is thrown away.

The second use, \imode<mode> is more complicated. If it mode matches, then TeX carries on as normal. If not, it starts gobbling stuff until it reaches another imode<mode> or \begin{document} or \end{document}. In the first case, it reevaluates the mode. In the second or third cases, it starts acting normally again.

Using the \imode command it is possible to specify material that is only processed in one mode.

(There is not yet support for specifying multi-modes.)

Post-Processing

Running pdflatex on a document written with the internet class results in a PDF. This needs to be converted to text. There are many excellent tools for doing that, the program pdftotext from the xpdf system works well. Use it as follows:

pdftotext -enc ASCII7 -layout -nopgbrk texfile.pdf

Since line-breaks and indentation are often significant in text formats, the text modes define an extremely wide page in the hope that no paragraph will be quite that long. Whilst pdftotext is fairly good at preserving the layout, it is not reliable as to preserving the exact number of spaces. For this reason, in a mode where indentation is significant, every line starts with a string of the form XXX: where the number of Xs is the number of spaces to indent. A simple perl script can convert those to spaces. A shell script, latex2txt.sh is provided that automates the process of going from LaTeX document to text output with correct indentation.

Limitations

Too many to mention!

The major limitation is to do with external packages. Most external packages will not work directly with this class (at least, in a text mode). This harks back to the problem of taking an already-written document and converting it. LaTeX packages were written with the understanding that the final outcome would be a static document, not some text that will be further processed.

That is not to say that no packages will ever be supported. Packages could be supported by writing an alternative which translates the commands to a sensible output. However, this will need to be done on a case-by-case basis as the need arises.

Thus, for now, it is best to put all the \usepackage statements inside a \imode<latex>{...} command.

Obtaining the Code

This package is still in early days so is not yet on CTAN. It can be downloaded here as a zip file or a tar.gz file. This also contains the source of this document, which was written using this class.

[Full link]
Last modified on:
Fri, 29th Jul 2011