GNU UnRTF User's Manual
For program version 0.18.1
(A work in progress.)
Copyright (C) 2001
by Zachary Thayer Smith.
All rights reserved.
Document begun 18 Sept 01.
Last updated 02 Oct 01.
Preface
Once upon a time, GNU UnRTF was a program that I wrote called
"rtf2htm". This seemed too generic a name, since many free programs
of varying quality exist with that name. So I finally settled on
a new name, UnRTF. This name reflects a desire to convert away
from the RTF format, to various other formats.
When it came time to include the program into the GNU software suite,
the program name was changed to GNU UnRTF.
This document is also provided AS-IS and without any warranty of any kind.
The user shall utilize the program and/or this document
at his or her own risk.
I am the primary engineer behind UnRTF, however I have
received comments and bug reports from various people.
These contributors are identified in the source code,
when they desired to be mentioned.
Program License
UnRTF, a command-line program to convert RTF documents to other formats.
Copyright (C) 2000,2001 Zachary Thayer Smith
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
The author is reachable by electronic mail at tuorfa@yahoo.com.
Introduction
UnRTF is a program to convert RTF (Rich Text) documents to
other formats. At present, conversion to HTML is the most
complete. I am presently adding LaTeX, plain text,
text with VT100 codes, and PostScript conversion.
I will later add my own format, WPML (word processor markup language),
to that list; that format is described at
http://www.geocities.com/tuorfa/wpml.html.
Converting to HTML
The program supports many features of the current RTF standard
when converting to HTML.
Character Attributes
Feature Name | Supported? |
Text font change | yes |
Text font sizes | yes |
Text bold, italic | yes |
Text single-underline | yes |
Other text underlining modes (double, dashed etc) | converted to basic underline |
Text shadow, outline, emboss, engrave | converted to bold or italic |
Text (single-line) strikethrough | yes |
Text double-strikethrough | converted to single-strikethrough |
Text all-caps | yes |
Text small-caps | yes |
Text superscript, subscript | yes |
Text expand/condense | yes (not all browsers supported) |
Text foreground color change | yes |
Text background color change | yes |
Character Sets
RTF supports at least four character sets, probably more.
These four are: ANSI, Macintosh(TM), PC codepage 437, and PC codepage 850.
In order to be able to read each of these, a converter can use one
of two strategies: either have conversion tables from each of
these four to each potential output format, or convert from
each of these four to an intermediate, and then have one conversion
table from the intermediate to each output format.
The first approach requires 2n tables, whereas
the second requires 4+n tables where n is the number of output
formats.
Obviously the second approach is
better, but implementing it requires research to find
out what the maximal set of characters is. I haven't gotten around to
that, so for the time being,
UnRTF uses the first approach.
In addition, existing open source software may already
be available to perform such conversions based on a larger
library of character sets. If so, it would be wiser to
utilize an existing system such as that.
Text Blocks
Feature Name | Supported? |
Tables | yes |
Table cell background patterns e.g. diagonal lines | no |
Paragraph left-align | yes |
Paragraph right-align | yes |
Paragraph centered | yes |
Paragraph justify | yes |
Paragraph center within table | buggy? |
Converting to LaTeX
LaTeX is a tricky format to convert to, for several reasons.
It's a very specialized system of macros. One could argue that it
would be easier to convert to raw TeX than bother with the
idiosyncrices of LaTeX. It has its own character set and fonts. It has
some commands which are unstable, such as \underline.
Some commonplace items are not for use outside of equations,
e.g. superscripting. I've made an initial effort at getting
the converter to work, with improvements later.
Character Attributes
Feature Name | Supported? |
Text font change | not yet |
Text font sizes | yes |
Text bold, italic | yes |
Text single-underline | no |
Other text underlining modes (double, dashed etc) | no |
Text shadow, outline, emboss, engrave | no |
Text (single-line) strikethrough | no |
Text double-strikethrough | no |
Text all-caps | yes |
Text small-caps | yes |
Text superscript, subscript | yes |
Text expand/condense | no |
Text foreground color change | no |
Text background color change | no |
Character Sets
Under construction.
Text Blocks
Feature Name | Supported? |
Tables | yes |
Table cell background patterns e.g. diagonal lines | no |
Paragraph left-align | yes? |
Paragraph right-align | no |
Paragraph centered | yes? |
Paragraph justify | yes |
Paragraph center within table | no |
Converting to PostScript
Converting to PostScript is a tricky because it is not actually
a document format. PostScript is in fact a stack-based programming
language that is executed in the printer.
It lacks such concepts are paragraphs and tables or anything document-related
really, but it does have drawing primitives, mechanisms for accessing
built-in fonts, and can print pages.
Still, at first it would that conversion to this format is a very large
obstacle. Actually, PostScript
is a robust and enjoyable programming language and I am enjoying
the task of writing the PostScript code. Presently my
text renderer is limited, since it is quite new. I will be improving it soon.
Character Attributes
Feature Name | Supported? |
Text font change | not yet |
Text font sizes | yes |
Text bold, italic | yes |
Text single-underline | yes |
Other text underlining modes (double, dashed etc) | converted to basic underline |
Text shadow, outline, emboss, engrave | shadow only |
Text (single-line) strikethrough | yes |
Text double-strikethrough | converted to single-strikethrough |
Text all-caps | yes |
Text small-caps | not yet |
Text superscript, subscript | not yet |
Text expand/condense | yes | |
Text foreground color change | not yet |
Text background color change | not yet |
Character Sets
Under construction.
Text Blocks
Paragraph alignment and tables are not yet supported for PostScript
output.
Extra Features
None yet.
Converting to Plain Text
Under construction.
Converting to Text with VT100 control codes
Under construction.
Converting to WPML
Under construction.
Features Not Yet Supported
As development continues, I will try to add support
for other features. Some that I know are not
covered but that I would like to address include:
- numbered lists and point lists
- shapes (objects composed of lines, circles etc)
- index entries and index generation
- tables of contents entries and generation
- automatic conversion of embedded images to PNG
Using UnRTF
Please refer to the manual page (unrtf.1).
Compilation
Please see the README file.
Theory of Operation
This program essentially reads the entire
RTF file into memory and works on it.
Because of this, it may require that you run
the program on a computer that has virtual
memory enabled. With smaller input files
it should be possible to use the program under DOS,
so long as it is compiled with the
DOS version of GCC, called DJGPP.
The program operates by dealing with each
RTF word in order, and interpreting those
which are commands. Some RTF command words
have parameters in a subtree. The command
\info is an example. The program has separate
routines to handle such cases. In fact,
most commands have separate functions which
handle their execution.
When the program was called rtf2htm (up through
version 0.17 or so), the output mechanism was
based on the production of HTML exclusively.
This has now changed, and the abstraction of an
OutputPersonality is used allow other output
formats. Each format has its own C file,
in which all the basic strings for producing
text are stored, as well as character conversion
tables. Note, RTF itself allows several character
sets to be used, so for each output personality
there are that many conversion tables.
One or two things that UnRTF does are fairly
tricky, such as the conversion of tabular data.
RTF encodes tables in an odd way compared
to HTML or LaTeX, so the code is accordingly
complicated. Suffice it to say that it works,
so don't touch it. Do note, PostScript does not
have concept of a table, since it is not
a document format but a programming language.
I will eventually get tables working under PS
anyway, by porting my table rendering code
over from my HTML viewer, Beest.
I have implemented at least three optimizations to
reduce the amount of memory required
by the program and the time used for the conversion.
- Text words and RTF command-words are stored in a
hash table. This has the effect of saving memory
since commonly occurring words such as "the" and "\par"
are not repeated in memory. When the program
finishes doing the conversion, it reports the
number of words hashed.
- RTF command-words and pointers to the functions
that interpret them are stored in a static hash
so that execution can be speedy. This replaces the
long if-else sequence once used and greatly speeds
up the program.
- Input data are buffered, to eliminate the large
number of calls to the fgetc function. In a modern
OS such as Linux this has only a small impact, but
under DOS it can really help.
Notes
-
LaTeX is a system of macros for TeX originated by Leslie Lamport
-
WPML is a tentative document format by Zachary Thayer Smith
-
PostScript is a stack-based programming language for printers.