Paleolithic Refugia Home

Image Format Comparison

For Archiving Documents

by

2 April 2005

Introduction   Recommendations by situation

The table belows shows the effects of scanning a printed word into several different formats. This is particularly aimed at people who need to archive printed or written documents, especially old ones, in digital format. The information may be useful to others as well. For a more general introduction to web image formats, especially as they apply to the web, I suggest Daniel Beardsmore's article Making good use of Web image formats.

The original word was printed in 12-point Palatino. Except for the large GIF image, all scans were done at 300dpi. The scanner was an HP Scanjet 4400c -- a low-level scanner -- using the supplied HP Precisionscan Pro software.

The images shown here are not the original scans, for two reasons. First, I wanted to display them several times normal size so that you have half a chance of seeing the artefacts. Second, I manipulated the images into formats which would not be further modified in the process of displaying them, and this dictated using PNG format for the display images.

There is no standard definition of "high, medium, low" for JPEG image quality, nor any standard measure of JPEG quality. The high, medium, low quality in the table are what the HP software offered. Some software allows you to pick the quality as a number, but that number is no more standardized than "high, medium, low".

Note that I do not discuss TIFF. TIFF is a file format, rather than an image format. It's possible to have various different image formats contained within a TIFF file. The acronym TIFF has become associated with lossless formats and especially with uncompressed formats, but that's only by convention and not by definition. Due to the variety of contained formats possible in TIFF, compatibility varies greatly. Now (April 2005) that PNG is widely recognized in current graphics software, there's little reason to consider TIFF.

I only consider open formats, not proprietary formats. For example, PSD (Photoshop) is an excellent format for preserving documents, but it's a proprietary format.

The file sizes are approximate, because the effectiveness of compression varies depending on the data. The "full page" is for a 8.5"x11.7" image, thus larger than either US letter or A4 paper.

 

Photographs for archival storage or later editing: PNG, second choice high quality JPEG.

Photographs to serve on the web: medium quality JPEG, second choice low quality JPEG.

Documents to OCR: GIF, 300dpi for large clean type, 600dpi for better accuracy.

Clean black and white documents as archive copy: GIF, scan at least 600dpi.

Documents to archive appearance: PNG, scan at least 300dpi.

If your hard disk is too small: Buy a large external disk and a DVD writer! External hard disks in the 150GB-250GB range, in USB or Firewire, now (2005) cost well under US$1/GB, even less for internals. You need backups; many new computers now include DVD writers, and a separate DVD writer is under $100 (internal). A full-page PNG image at 300dpi will use about 20MB of disk. But a 200GB disk can hold well over 10,000 of these!

If your software can't read PNG files: Most likely you are running old software. I sympathize with the desire to keep running old systems which are working perfectly well, but some of the newer releases have brought some real improvements. If you are serious about archiving images of documents, this is a good reason to upgrade.

Half-tone images, such as magazine and newspaper photographs: these require specialized scanning techniques which I do not cover here.

Format Description File Size Sample/Page Recommendations Sample Image (magnified 2x)

PNG is a lossless format which captures full color. It's compressed, but what you see is exactly what the scanner originally saw. The compression is only mildly effective, as you can see by comparing the file size with even the highest quality JPEG file. The image looks slightly fuzzy in the magnification, but this is only because when the scanner scanned a pixel which was half white and half black, it averaged to gray. So the gray edges in fact are real data, not an artefact.

PNG
300dpi


29KB/15.2MB

Because PNG is lossless and full color, it is the largest of the files. However, it has no compression artefacts and can be edited repeatedly without loss of quality. Best for any image (other than black and white) which may need to be edited in the future. Best for archival storage. Good for all full color images, including pictures of documents when the actual appearance is important.

PNG image

GIF is a lossless format which captures very limited color. In my scans, I specified that the GIF scans only use black and white. It's compressed, but what you see is exactly what the scanner originally saw. The compression is very effective on most images. The edge looks grainy because there's no gray to fill in the part between full black and full white. For clean original documents, a B&W format such as this is ideal for OCR.

GIF
300dpi


1KB/88KB

Because the image is B&W (and thus one bit per pixel compared with 24 bits per pixel for full color), and because the compression is so effective, this is the smallest of the files. However, it does not look good at this size due to the graininess. Good for OCR on clean printed or typed documents. This is shown at many times the original size; when printed at 300dpi it will still look a lot better than a fax.

GIF image

This image is also GIF but is four times the resolution of the previous GIF. (The image here is only twice the size because I didn't double this one.) I scanned this one at 1200dpi instead of the 300 dpi I used for the others. Despite having 16 times as many pixels, it's still a very small file.

GIF
1200dpi


5KB/649KB

Best for OCR on clean printed or typed documents, though 1200dpi is usually overkill. Use 600dpi for best OCR speed and accuracy on most documents.

GIF image quad resolution

JPEG uses lossy compression: once it's been compressed, you cannot get an exact original back. The compression uses a model of human vision to discard information from the image which the human eye won't miss. Quality can be set to a range of values; low quality corresponds to more information discarded.

JPEG is very good for continuous tone images, the kind that photographs are mostly composed of. It falls down badly on images with many high-contrast sharp edges -- the kind of thing that documents are almost entirely composed of. Look at all the noise around and within the letters in the low quality JPEG image. It's a low-quality JPEG -- meaning a lot of information was discarded -- but it shows what happens in all JPEG images with high contrast edges, just more dramatically. You see less noise in the medium quality image, but it's still easy to see. Artefacts in the high quality image are very hard to see even if you magnify it two or three times. (Some web browsers, such as Opera, make magnifying the page easy.) You can still detect them in some documents though, and these artefacts can confuse OCR software.

Avoid repeatedly editing JPEG images at any quality, because you lose additional information each time the image is recompressed to save it.

JPEG
300dpi
low quality


2KB/313KB

Good for photographs on the web because it does a reasonably good job (especially for small images) and minimizes download times. Never use for archival copies or for copies which may need editing again. Never use for documents.

JPG image low quality
JPEG
300dpi
medium quality


3KB/371KB

Best for photographs on the web. Download times are only a little more than for low quality, and the appearance is often considerably better, especially for medium to large sized images. Never use for archival copies or for copies which may need editing again. Never use for documents. But medium quality JPEG does an excellent job of showing photographs on the web.

JPG image medium quality
JPEG
300dpi
high quality


14KB/4.40MB

Very good for archival storage of photographs which will not need editing again, or only minor editing. Poor for images on the web due to large size -- convert to medium quality for the web. Marginally OK for archiving documents in a pinch, if black and white is inadequate and you don't have space for PNG images.

JPG image high quality