NSSDC Canonical File

By Don Sawyer and John Garrett

Motivation

When NSSDC decided that it needed to move beyond the vendor-specific VMS/Files-11 management system used by NDADS, it developed a requirement for converting the NDADS files to a vendor-independent form while still maintaining all the information necessary to properly handle the file or to be able to recreate a faithful replica of the file in a VMS/Files-11 management system. The vendor-independent form needed to be able to accommodate information from a variety of file systems, and it needed to be one that could be easily moved among different file systems without significant information loss. In other words, NSSDC needed a type of 'canonical file' to meet this criteria.

Requirements

The types of VMS files in the NDADS system were examined to determine the extent of the VMS problem. Of the three possible types of VMS file organization, sequential, relative, and indexed, only sequential files were present. This implied a great simplification in the range of the VMS attributes needed to understand and use the files. The VMS record formats were examined and it was found that fixed-length, variable-length, and stream formats were present, while the variable-length with a control field format was absent. Of the four types of VMS record control, referred to as none, carriage_control, FORTRAN, and print, only print was absent. In addition, files types were both ASCII and binary. This implied that the main VMS issue to be overcome was a way to maintain record boundary information in non-VMS environments.

It was also determined that the generation of the appropriate canonical file form needed to be based on the data and underlying file system supporting that data so that creating the canonical form could be readily automated and widely applicable. Further, changes to the original file structure should be minimized, and the resulting canonical file form should be maximally usable in a variety of operating systems with common utilities.

Design Decisions

The major issue in moving the NSSDC VMS files to a canonical form is to capture information on record boundaries that was previously known and maintained by the VMS file system, but not within the file data stream itself. For binary files with variable length records, it was decided to insert an "NSSDC maintained" record separator into the data stream at record boundaries, in the form of a prefixed byte count, forming an NSSDC canonical file. This is documented in a separate, but associated, NSSDC attribute object so that the original data stream could be recovered as needed. The association is provided by an implementation of an AIP. The alternative, consisting of creating another data object to record the byte positions of the record boundaries, was not taken because it is more complex and always requires the use of two files to obtain the data on a record basis. With the current approach the canonical form carries the record boundary information directly. This analysis and decision making process were led by Don Sawyer and Bob Candey.

Some of the VMS 7-bit ASCII files with fixed record lengths, and those with variable records lengths, also needed the insertion of an "NSSDC maintained" record separator. This lead to the definition of four NSSDC canonical forms, labeled "A", "B", "C", and "D", and defined as follows:

1. Canonical Form A: Data are binary and there are no NSSDC Archive maintained record separators

2. Canonical Form B: Data are binary and there are NSSDC Archive maintained record separators (currently a 2-byte count)

3. Canonical Form C: Data are 7-bit ASCII and there are no NSSDC Archive maintained record separators

4. Canonical Form D: Data are 7-bit ASCII and there are NSSDC Archive maintained record separators (currently CR/LF)

As can be seen, forms A and B are used for binary data streams. Binary data streams are modified only when they have variable length records. A 2-byte integer field, unsigned and formatted Big Endian , is inserted to prefix each record. Forms C and D are used for 7-bit ASCII data streams. ASCII streams may need the insertion of a record delimiter following the record, whether fixed or variable length, depending on the VMS attributes. The record delimiter chosen is the carriage-return line-feed pair because it was felt this would give some record indication when used with common utilities on a variety of platforms. Note that all canonical forms, including the ASCII forms, are to be transferred among systems using binary transfer protocols to ensure no unintended conversion of any bytes.

The following table gives the valid VMS file attribute combinations for NSSDC's NDADS data, and their resulting mapping to a canonical form.

File Type	Data Type	Rec Format	Rec Control	  Canonical Form
    1		7 bit ASCII	Fixed		None			C
    2		7 bit ASCII	Fixed		CC			D
    3		7 bit ASCII	Fixed		Fortran			D
    4		7 bit ASCII	Stream_LF	CC			D
    5		7 bit ASCII	Undefined	None			C
    6		7 bit ASCII	Variable	None			D
    7		7 bit ASCII	Variable	CC			D
    8		7 bit ASCII	Variable	Fortran			D
    9		Binary		Fixed		None			A
   10		Binary		Undefined	None			A
   11		Binary		Variable	None			B

Table 1. Mapping of VMS File Types to NSSDC Canonical Forms

Note: All files are 'Sequential', with record sizes up to 32767 bytes.

When files are generated on non-VMS platforms, such as UNIX, and are to be put into AIPs, information on record boundaries is not carried by the underlying file system. In this case the canonical forms used will be A and C for binary and 7-bit ASCII respectively, which means that there is no change between the original data stream and the canonical form.

Return to NSSDC News Table of Contents


NASA home page GSFC home page GSFC organizational page
Curator: Natalie Barnes
Responsible Official: Dr. Joseph H. King, Code 633
Last Revised: [NAB]