File Input and Output¶

Serialized data stored in files on disk (or over the network) can be imported into PyGeode with the routines open() (for single files), openall() (for small numbers of files), and open_multi() (for large numbers of files). Variables and datasets can be saved to disk with save().

Several formats are supported natively by pygeode, including NetCDF (versions 3 and 4), HDF (versions 4 and 5) and grib files, though support for NetCDF and HDF is the most complete. By default the format is detected through the file extension; however, many of the methods below accept an optional format argument which can be used to specify the format explicitly.

Format	Extension	String Identifier	Notes
NetCDF	.nc	netcdf
HDF5		netcdf	Uses the same library as NetCDF
NetCDF4	.nc	netcdf4	Requires the netCDF4 package.
HDF4	.hdf	hdf4
GRIB	.grib	grib

Plugins for other formats are available in add-on packages, such as pygeode-rpn for reading RPN Standard Files from Environment and Climate Change Canada.

Code also exists to read the native binary format used by the Canadian Centre for Climate Modeling and Analysis, but this is not distributed with PyGeode by default.

pygeode.open(filename, format=None, value_override={}, dimtypes={}, namemap={}, varlist=[], cfmeta=True, **kwargs)[source]¶

Returns a Dataset containing variables defined in a single file or a dict of Dataset for a netcdf4 file containing groups.

Parameters

filenamestring

Path of file to open

formatstring, optional

String specifying format of file to open. If none is given the format will be automatically detected from the file (see autodetectformat())

value_overridedict, optional

A dictionary containing arrays with which to override values for one or more variables (specified by the keys). This can be used for instance to avoid loading the values of an axis whose values are severely scattered across a large file.

dimtypesdict, optional

A dictionary mapping dimension names to axis classes. The keys should be axis names as defined in the file; values should be one of:

an axis instance, which will be used directly
an axis class, which will be used to create a new instance with the values given by the file
a tuple of an axis class and a dictionary with keyword arguments to pass to that axis’ constructor

If dimtypes is not specified, an attempt is made to automatically identify the axis types (see optional cfmeta argument below)

namemapdict, optional

A dictionary to map variable names as specified in the file (keys) to PyGeode variable names (values); also works for axes/dimensions

varlistlist, optional

A list (of strings) specifying the variables that should be loaded into the data set (if the list is empty, all NetCDF variables will be loaded)

cfmetaboolean

If true, an attempt to identify the type of each dimension is made following the CF metadata conventions.

Returns

dataset: A dataset containing the variables contained in the file or a dict of datasets. The variable data itself is not loaded into memory.

See also

openall
open_multi

Notes

The format of the file is automatically detected from the filename (if possible); otherwise it must be specified by the format argument. The identifiers used in varlist and dimtypes are the original names used in the NetCDF file, not the names given in namemap. The optional arguments are not currently supported for netcdf4 files containing groups.

pygeode.openall(files, format=None, opener=None, **kwargs)[source]¶

Returns a Dataset containing variables merged across multiple files.

Parameters

filesstring, list, or tuple: Either a single filename or a list of filenames. Wildcards are supported, glob.iglob() is used to expand these into an explicit list of files.
formatstring, optional: String specifying format of file to open. If none is given the format will be automatically detected from the first filename (see autodetectformat())
openerfunction, optional: Function to open individual files. If none is provided, uses the format-specific version of open(). The datasets returned by this function are then concatenated and returned. See Notes.
sortedboolean, optional: If True, the filenames are sorted (by alpha) prior to opening each file, and the axes on the returned dataset are sorted by calling Dataset.sorted().
**kwargskeyword arguments: These are passed on to the function opener;

Returns

dataset: A dataset containing the variables concatenated across all specified files. The variable data itself is not loaded into memory.

See also

open
open_multi

Notes

The function opener must take a single positional argument - the filename of the file to open - and keyword arguments that are passed through from this function. It must return a Dataset object with the loaded variables. By default the standard open() is used, but providing a custom opener can be useful for any reshaping of the variables that must be done prior to concatenating the whole dataset.

Once every file has been opened, the resulting datasets are concatenated using dataset.concat().

This function is best suited for a moderate number of files. Because each file must be explicitly opened to read the metadata, even this can take a significant amount of time if a large number of files are being opened. For these cases using open_multi() can be much more efficient, though it requires more coding effort initially. The underlying concatenation is also more efficient when the data is actually accessed.

pygeode.open_multi(files, format=None, opener=None, pattern=None, file2date=None, **kwargs)[source]¶

Returns a Dataset containing variables merged across many files.

Parameters

filesstring, list, or tuple: Either a single filename or a list of filenames. Wildcards are supported, glob.iglob() is used to expand these into an explicit list of files.
formatstring, optional: String specifying format of file to open. If none is given the format will be automatically detected from the first filename (see autodetectformat())
openerfunction, optional: Function to open individual files. If none is provided, uses the format-specific version of open(). The datasets returned by this function are then concatenated and returned. See Notes.
patternstring, optional: A regex pattern to extract date stamps from the filename; used by default file2date. Matching patterns must be named <year>, <month>, <day>, <hour> or <minute>. Abbreviations are available for the above; $Y matches a four digit year, $m, $d, $H, and $M match a two-digit month, day, hour and minute, respectively.
file2datefunction, optional: Function which returns a date dictionary given a filename. By default this is produced by applying the regex pattern pattern to the filename.
sortedboolean, optional: If True, the filenames are sorted (by alpha) prior to opening each file, and the axes on the returned dataset are sorted by calling Dataset.sorted().
**kwargskeyword arguments: These are passed on to the function opener;

Returns

dataset: A dataset containing the variables concatenated across all specified files. The variable data itself is not loaded into memory.

See also

open
openall

Notes

This is intended to provide access to large datasets whose files are separated by timestep. To avoid opening every file individually, the time axis is constructed by opening the first and the last file in the list of files provided. This is done to provide a template of what variables and what times are stored in each file - it is assumed that the number of timesteps (and their offsets) is the same accross the whole dataset. The time axis is then constructed from the filenames themselves, using the function file2date to generate a date from each filename. As a result only two files need to be opened, which makes this a very efficient way to work with very large datasets.

However, no explicit check is made of the integrity of the files - if there are corrupt or missing data within individual files, this will not become clear until that data is actually accessed. This can be done explicitly with check_multi(), which explicitly attempts to access all the data and returns a list of any problems encountered; this can take a long time, but is a useful check (and is more likely to provide helpful error messages).

pygeode.save(filename, dataset, format=None, cfmeta=True, **kwargs)[source]¶

Saves a Var or Dataset to file.

Parameters

filenamestring: Path of file to save to.
datasetVar, Dataset, or collection of Var objects or: dict of Dataset objects. The dataset is consolidated using dataset.asdataset(). Dicts of Dataset objects are written as groups to netcdf4 files.
formatstring, optional: String specifying format of file to open. If none is given the format will be automatically detected from the file (see autodetectformat())
cfmetaboolean: If true, metadata is automatically written specifying the axis dimensions following CF metadata conventions.

Notes

The format of the file is automatically detected from the filename (if possible). The NetCDF format is at present the best supported.

pygeode.formats.autodetectformat(filename)[source]¶

Returns best guess at file format based on file name.

Parameters

filenamestring: Filename to identify

Returns

string: String specifying identified file format.

Raises

ValueError: If the format cannot be determined from the extension.

See also

extdict

pygeode.formats.multifile.check_multi(*args, **kwargs)[source]¶: Validates the files for completeness and consistency with the assumptions made by pygeode.formats.multifile.open_multi.