File Input and Output

Serialized data stored in files on disk (or over the network) can be imported into PyGeode with the routines open() (for single files), openall() (for small numbers of files), and open_multi() (for large numbers of files). Variables and datasets can be saved to disk with save().

Several formats are supported natively by pygeode, including NetCDF (versions 3 and 4), HDF (versions 4 and 5) and grib files, though support for NetCDF and HDF is the most complete. By default the format is detected through the file extension; however, many of the methods below accept an optional format argument which can be used to specify the format explicitly.

Format

Extension

String Identifier

Notes

NetCDF

.nc

netcdf

HDF5

netcdf

Uses the same library as NetCDF

NetCDF4

.nc

netcdf4

Requires the netCDF4 package.

HDF4

.hdf

hdf4

GRIB

.grib

grib

Plugins for other formats are available in add-on packages, such as pygeode-rpn for reading RPN Standard Files from Environment and Climate Change Canada.

Code also exists to read the native binary format used by the Canadian Centre for Climate Modeling and Analysis, but this is not distributed with PyGeode by default.

pygeode.open(filename, format=None, value_override={}, dimtypes={}, namemap={}, varlist=[], cfmeta=True, **kwargs)[source]

Returns a Dataset containing variables defined in a single file or a dict of Dataset for a netcdf4 file containing groups.

Parameters
filenamestring

Path of file to open

formatstring, optional

String specifying format of file to open. If none is given the format will be automatically detected from the file (see autodetectformat())

value_overridedict, optional

A dictionary containing arrays with which to override values for one or more variables (specified by the keys). This can be used for instance to avoid loading the values of an axis whose values are severely scattered across a large file.

dimtypesdict, optional

A dictionary mapping dimension names to axis classes. The keys should be axis names as defined in the file; values should be one of:

  1. an axis instance, which will be used directly

  2. an axis class, which will be used to create a new instance with the values given by the file

  3. a tuple of an axis class and a dictionary with keyword arguments to pass to that axis’ constructor

If dimtypes is not specified, an attempt is made to automatically identify the axis types (see optional cfmeta argument below)

namemapdict, optional

A dictionary to map variable names as specified in the file (keys) to PyGeode variable names (values); also works for axes/dimensions

varlistlist, optional

A list (of strings) specifying the variables that should be loaded into the data set (if the list is empty, all NetCDF variables will be loaded)

cfmetaboolean

If true, an attempt to identify the type of each dimension is made following the CF metadata conventions.

Returns
dataset

A dataset containing the variables contained in the file or a dict of datasets. The variable data itself is not loaded into memory.

See also

openall
open_multi

Notes

The format of the file is automatically detected from the filename (if possible); otherwise it must be specified by the format argument. The identifiers used in varlist and dimtypes are the original names used in the NetCDF file, not the names given in namemap. The optional arguments are not currently supported for netcdf4 files containing groups.

pygeode.openall(files, format=None, opener=None, **kwargs)[source]

Returns a Dataset containing variables merged across multiple files.

Parameters
filesstring, list, or tuple

Either a single filename or a list of filenames. Wildcards are supported, glob.iglob() is used to expand these into an explicit list of files.

formatstring, optional

String specifying format of file to open. If none is given the format will be automatically detected from the first filename (see autodetectformat())

openerfunction, optional

Function to open individual files. If none is provided, uses the format-specific version of open(). The datasets returned by this function are then concatenated and returned. See Notes.

sortedboolean, optional

If True, the filenames are sorted (by alpha) prior to opening each file, and the axes on the returned dataset are sorted by calling Dataset.sorted().

**kwargskeyword arguments

These are passed on to the function opener;

Returns
dataset

A dataset containing the variables concatenated across all specified files. The variable data itself is not loaded into memory.

See also

open
open_multi

Notes

The function opener must take a single positional argument - the filename of the file to open - and keyword arguments that are passed through from this function. It must return a Dataset object with the loaded variables. By default the standard open() is used, but providing a custom opener can be useful for any reshaping of the variables that must be done prior to concatenating the whole dataset.

Once every file has been opened, the resulting datasets are concatenated using dataset.concat().

This function is best suited for a moderate number of files. Because each file must be explicitly opened to read the metadata, even this can take a significant amount of time if a large number of files are being opened. For these cases using open_multi() can be much more efficient, though it requires more coding effort initially. The underlying concatenation is also more efficient when the data is actually accessed.

pygeode.open_multi(files, format=None, opener=None, pattern=None, file2date=None, **kwargs)[source]

Returns a Dataset containing variables merged across many files.

Parameters
filesstring, list, or tuple

Either a single filename or a list of filenames. Wildcards are supported, glob.iglob() is used to expand these into an explicit list of files.

formatstring, optional

String specifying format of file to open. If none is given the format will be automatically detected from the first filename (see autodetectformat())

openerfunction, optional

Function to open individual files. If none is provided, uses the format-specific version of open(). The datasets returned by this function are then concatenated and returned. See Notes.

patternstring, optional

A regex pattern to extract date stamps from the filename; used by default file2date. Matching patterns must be named <year>, <month>, <day>, <hour> or <minute>. Abbreviations are available for the above; $Y matches a four digit year, $m, $d, $H, and $M match a two-digit month, day, hour and minute, respectively.

file2datefunction, optional

Function which returns a date dictionary given a filename. By default this is produced by applying the regex pattern pattern to the filename.

sortedboolean, optional

If True, the filenames are sorted (by alpha) prior to opening each file, and the axes on the returned dataset are sorted by calling Dataset.sorted().

**kwargskeyword arguments

These are passed on to the function opener;

Returns
dataset

A dataset containing the variables concatenated across all specified files. The variable data itself is not loaded into memory.

See also

open
openall

Notes

This is intended to provide access to large datasets whose files are separated by timestep. To avoid opening every file individually, the time axis is constructed by opening the first and the last file in the list of files provided. This is done to provide a template of what variables and what times are stored in each file - it is assumed that the number of timesteps (and their offsets) is the same accross the whole dataset. The time axis is then constructed from the filenames themselves, using the function file2date to generate a date from each filename. As a result only two files need to be opened, which makes this a very efficient way to work with very large datasets.

However, no explicit check is made of the integrity of the files - if there are corrupt or missing data within individual files, this will not become clear until that data is actually accessed. This can be done explicitly with check_multi(), which explicitly attempts to access all the data and returns a list of any problems encountered; this can take a long time, but is a useful check (and is more likely to provide helpful error messages).

The function opener must take a single positional argument - the filename of the file to open - and keyword arguments that are passed through from this function. It must return a Dataset object with the loaded variables. By default the standard open() is used, but providing a custom opener can be useful for any reshaping of the variables that must be done prior to concatenating the whole dataset.

pygeode.save(filename, dataset, format=None, cfmeta=True, **kwargs)[source]

Saves a Var or Dataset to file.

Parameters
filenamestring

Path of file to save to.

datasetVar, Dataset, or collection of Var objects or

dict of Dataset objects. The dataset is consolidated using dataset.asdataset(). Dicts of Dataset objects are written as groups to netcdf4 files.

formatstring, optional

String specifying format of file to open. If none is given the format will be automatically detected from the file (see autodetectformat())

cfmetaboolean

If true, metadata is automatically written specifying the axis dimensions following CF metadata conventions.

Notes

The format of the file is automatically detected from the filename (if possible). The NetCDF format is at present the best supported.

pygeode.formats.autodetectformat(filename)[source]

Returns best guess at file format based on file name.

Parameters
filenamestring

Filename to identify

Returns
string

String specifying identified file format.

Raises
ValueError

If the format cannot be determined from the extension.

See also

extdict
pygeode.formats.multifile.check_multi(*args, **kwargs)[source]

Validates the files for completeness and consistency with the assumptions made by pygeode.formats.multifile.open_multi.