Reading and Writing Compressed Data Sets

Introduction
Chunking and Cacheing
Setting Cache Size
Reading Compressed Files
Writing Compressed NetCDF Files
Setting Chunk Size

line

Introduction
GrADS version 2.0.a8 introduces the capability to read write compressed netCDF files. The use of compression can significantly reduce data volume and speed up the I/O. Optimizing the performance depends on a number of factors: configuration of the software, available memory on the local hardware, data volume, even the nature of the I/O request (e.g. an X-Y plane or a Z-T cross section). It is impossible to customize GrADS to account for all the external factors, but it is possible to make some good choices for the internal software controls based on what is known about the data, the kind of analysis to be performed, and the system on which GrADS is running.

Chunking and Cacheing
A compressed netCDF file is actually an HDF5 file, so we begin with some HDF5 vocabulary. In HDF5, a chunk is an atomic unit of data -- chunks are handled individually for compression, reading and writing to disk, and storing in the cache. A multi-dimensional variable is a collection of chunks; each chunk has the same number of dimensions as the variable, and the size of a chunk in any dimension is equal to or less than the size of the variable's dimensions.

When a compressed data set is written out to a file, it is first divided into chunks, then each chunk is compressed, and finally the compressed chunk is written out to disk. When a client (such as GrADS) reads a compressed data set, a compressed chunk is read from disk, decompressed, and then the requested data values are returned to the client. The HDF5 library uses a cache to store decompressed chunks in memory in case the client requests data from the same chunk again; data values are returned to the client much faster from the cache because the disk I/O and decompression steps are skipped.

To understand and optimize the performance of GrADS when reading a compressed netCDF file, it is helpful to know that GrADS does I/O by rows. Suppose a chunk contains 10 rows of a 2D grid, and the user wants to display 100 rows. The first chunk is read from disk, decompressed, cached, and the first row is delivered to GrADS. For the next nine rows, the library has cached the chunk, so the I/O is faster. For the 11th row, the next chunk is read from disk and the process repeats. If the size of the cache is too small to store an entire chunk, then the library releases the chunk after reading one row, and then must read and decompress the 1st chunk from disk again in order to read the 2nd row. In this case, when the cache size is very small, the performace of GrADS when reading compressed data will be very slow. It is vital that the cache be big enough to hold many chunks!

The HDF5 library allocates the chunk cache on a per-variable basis, so if you open a file and display many variables, you can quickly eat up a lot of memory. Be careful that the cache does not get too large, otherwise you may use up all the available memory on your system which will cause GrADS to crash.

Setting Cache Size
It has been established that the cache size must be big enough to hold many chunks, but not so big that reading several variables in a file will use up all available memory. So how big should it be? The answer depends on the chunk size. For good performance in GrADS, the cache should be at least big enough to hold enough chunks to cover the global lat/lon domain. GrADS will set a default cache size that is based on the grid dimensions of a data set. The formula to calculate the default cache size is:  Xsize * Ysize * 8 * scale factor. The scale factor is 1.0 by default, but it can be changed by the user with the set cachesf command. The user can change the scale factor depending on the available memory on the system where GrADS is running. If available memory is limited, reduce the scale factor to a number less than 1.0. If memory is abundant, set the scale factor to a number greater than 1.0. Another way to override the default cache size is to use the CACHESIZE entry in the data descriptor file. This option is recommended if the data files have an especially large chunk size. In most cases, the default cache size set by GrADS should be satisfactory. Current value of a file's cache size may be discovered with the query cache command.

Before you open a compressed netCDF file, you can find out what the chunk size is by using the ncdump utility with the -s option, or the h5dump utility with the -p option. These utilities are compiled along with the HDF5 and NetCDF-4 libraries, but they are not included in the GrADS distribution.

Reading Compressed Files
As long as GrADS has been compiled with netCDF library version 4.1 or higher (check q config to discover the version number), then reading a compressed netCDF file will work exactly the same way in GrADS as reading a "classic" netCDF file. The fact that the data are compressed should be nearly invisible to the user. The HDF5 interface also automatically handles compressed files.

For example, suppose you have a data set with a grid resolution of 0.10 degrees, with 3600 grid points in the X dimension and 1800 grid points in the Y dimension, and the chunk size is 360 x 180. The size of a single chunk is about 256 Kb, and there are 100 chunks in a global grid. GrADS will set the default cache size to 51840000 bytes, which is more than enough to keep all the chunks in a single global grid in the cache. This variable would be read in quickly. Subsequent displays of this same variable would be extremely fast because the data would be read entirely from memory. In many cases, the I/O for a compressed netCDF file will be faster than a regular netCDF file because the time it takes to read the compressed chunks from disk and decompress them is less than the time it takes to read the non-compresed data from disk.

When a user opens a file and issues an initial I/O request, GrADS will compare the chunk size and the current cache size and make sure that at least one chunk will fit in the cache. If the chunks are too big, GrADS will issue a warning ("... The I/O for this variable will be extremely slow...") and show you the chunk and cache sizes. If you see this message, follow these instructions:
1. Close the file
2(a). Add a CACHESIZE entry in the data descriptor file with a suitable value, OR
2(b). Increase the cache size scale factor with the set cachesf command
3. Re-open the file

Writing Compressed NetCDF Files
You can create compressed netCDF output files with GrADS by using the sdfwrite command. To do this, use the -zip option with the set sdfwrite command, and then set the chunk size with the set chunksize command before invoking sdfwrite. More on how to set good chunk sizes is in the section below. The compressed netCDF files created with GrADS use zlib compression level 1. Higher compression levels are not recommended because they require more time to compress/decompress and do not add a significant reduction in file size.

Setting Chunk Size
The chunk dimension sizes are set at the time the data are written out to file. Once the data are written, there is no way to change the chunk sizes except to copy the data to a new file. If you are creating a compressed netCDF file, be sure to set the chunk size carefully and keep in mind that other users of your data file may not have the same memory resources that you do. An estimate of the size of an decompressed chunk is 4 bytes (for floating point data values) multiplied by the all chunk dimension sizes plus a little more for metadata. It is recommended to try to keep the size of an uncompressed chunk in the ballpark of ~512 Kb.

The default behavior of GrADS is to set the chunk size equal to the variable's full grid size for the longitude (X) and latitude (Y) dimensions, and 1 (one) for all other dimensions. In this case, a chunk would be a single global 2-D lon/lat grid. However, if your data set is of sufficiently high resolution (e.g., if the grid size is less than 1.0 degree latitude/longitude), then you should use the set chunksize command to set the chunk size smaller than the grid size in the longitude and latitude dimensions -- divide by 2, or 5, or 10 as necessary to keep the chunk in the ballpark of 512 kbytes. Unless your data set does not vary in longitude and latidude, keep the chunk size equal to 1 for the level (Z), time (T), and ensemble (E) dimensions. If the chunk size is too big, then the cache can never be adequate to support the I/O at a reasonable speed. Chunks that are too small do a lot less harm than chunks that are too big.