DATA COMPRESSION FOR VERY SPARSE INTEGER ARRAYS
1. Introduction
The method of data compression described in this article was developed for
archiving and processing EXOSAT images, but could be adapted to other kinds of
very sparse data also. It is perhaps not the optimum method in terms of reducing
data volume, but it is relatively simple to implement and does not lead to any
loss of information from the original uncompressed data.
All EXOSAT images exist already on Final Observation Tapes in the form of a list of X-ray events. Usually one event is described by 8 bytes
of data (position, quality, sum signal, and time tag). For applications which
need only the positions of events, only 2 of the above 8 bytes per event need be
retained, however this more compact form for storing positional data has never
been implemented in the EXOSAT observatory software; up to now the accumulation
of pixels has been performed by starting from the 8-byte events on an FOT.
Recently the following even more compact way of storing positional data has
instead been implemented; it requires only about 4 bits per event, and so is
about 4 times more compact than simply listing the positions of individual
photons; for very long exposures it can be up to 8 times more compact.
2. Compression Algorithm.
In a typical 2048 x 2048-pixel EXOSAT image the average density of photons is as
little as I per 5-10 pixels; pixels in background areas of the image will
therefore nearly all have values of 0 or 1. This property is shared by printed
text (the corresponding values being white and black) and suggests that one could
copy a data compression method used for facsimile transmission (see e.g. ref 1).
However, in parts of the image there axe, of course, pixels with values exceeding
1, and allowance for any 16-bit integer value must be made. A system of encoding
GROUPS of consecutive pixels has been adopted, using four Huffman codes: 1, 01, 001,
and 000. These are bit patterns which indicate how to decode the accompanying
group of pixels. The four corresponding encoding methods are defined by the
following bit strings:
(i) I represents 4 consecutive zero-valued pixels; the zeroes are not actually
present in the file but are implied by this single bit set to 1,
(ii) 01nn represents 4 pixels containing just one photon in position nn within the group,
the other 3 pixels being implicitly zero,
(iii) 001nnmm represents 4 pixels containing two photons in positions nn and mm within the
group,
(iv) 000aabbccdd represents 4 pixels containing at least three photons altogether.
The numbers nn, aa etc are two-bit binary integers. In methods (ii) and (iii) it is more economical
to encode the position of a photon as an offset within the group of four pixels;
method (iii) includes the case of two photons in the same pixel (nn=mm). For
method (iv), however, the pixels are directly encoded as two-bit integers if they
have values 0, 1, or 2. Individual pixels in the range 3 to 17 are encoded as
(binary) 11 followed by a four-bit string in the range 0 to Pixels greater than 17 are encoded as (binary) 111111 followed by the full 16-bit
pixel value. Thus the total length of the bit string for method (iv) is variable,
although most commonly 11 bits including the Huffman code.
The full field-of-view is subdivided into 16 x 16 "small maps" each of 128 x 128
pixels, as on the FOT itself; a table near the beginning of a compressed
image gives the offset of each small map and the total number of photons
it contains. The decompression of the data therefore has to be performed
in units of a complete small map, and unwanted areas of the image can be
disregarded, in order to reduce the CPU time needed.
The size of a Pixel group was chosen above to be four, because this leads to a
minimum size for a well-populated small map. Any other small power of two could
instead have been chosen, with a similar encoding algorithm, and indeed for a
relatively empty small map (between 1 and 1024 photons) a group size of 16 was adopted; it is advantageous because the minimum-
overhead of 512 bytes (4096 groups each encoded according to method (i) above) is
reduced to only 128 bytes, at the expense of increasing the number of additional
bits per photon from 3 to 5. A small map containing no photons at all is entirely
omitted from the compressed image; this happens quite frequently, especially when
the on-board software uses a spatial filter to select events. Clearly there exist
possibilities for varying the algorithm, and some tests were made with other
group sizes and additional Huffman codes, but none resulted in any worthwhile
reduction to the total sizes of images formed from actual EXOSAT data, and so for
the sake of simplicity the algorithm was restricted to the cases already
described.
The decoding of compressed images relies on the efficient manipulation of bit
strings; care must be taken not to introduce unnecessary inefficiencies because
of the possible impact on CPU time. A precise specification of the image layout
and a FORTRAN listing of the decompression routines currently in use on the
HP-1000 computers at ESTEC can be provided on request. A program wil be provided
to convert these compressed images to FITs format.
Reference.
l. Huang, T.S. and Hussian, A.B.S., Facsimile Coding by Skipping White,
IEEE Trans Commun COM 23, 12 1975 pp,1452-1466.
J.R.Sternberg
[EXOSAT Home]
[About EXOSAT]
[Archive]
[Software]
[Gallery]
[Publications]