File Input and Output¶
Serialized data stored in files on disk (or over the network) can be imported
into PyGeode with the routines open()
(for single files), openall()
(for small numbers of files), and open_multi()
(for large numbers of
files). Variables and datasets can be saved to disk with save()
.
Several formats are supported natively by pygeode, including NetCDF (versions 3
and 4), HDF (versions 4 and 5) and grib files, though support for NetCDF and
HDF is the most complete. By default the format is detected through the file extension;
however, many of the methods below accept an optional format
argument which can be used to
specify the format explicitly.
Format |
Extension |
String Identifier |
Notes |
---|---|---|---|
NetCDF |
.nc |
netcdf |
|
HDF5 |
netcdf |
Uses the same library as NetCDF |
|
NetCDF4 |
.nc |
netcdf4 |
Requires the netCDF4 package. |
HDF4 |
.hdf |
hdf4 |
|
GRIB |
.grib |
grib |
Plugins for other formats are available in add-on packages, such as pygeode-rpn for reading RPN Standard Files from Environment and Climate Change Canada.
Code also exists to read the native binary format used by the Canadian Centre for Climate Modeling and Analysis, but this is not distributed with PyGeode by default.
- pygeode.open(filename, format=None, value_override={}, dimtypes={}, namemap={}, varlist=[], cfmeta=True, **kwargs)[source]¶
Returns a
Dataset
containing variables defined in a single file or a dict ofDataset
for a netcdf4 file containing groups.- Parameters
- filenamestring
Path of file to open
- formatstring, optional
String specifying format of file to open. If none is given the format will be automatically detected from the file (see
autodetectformat()
)- value_overridedict, optional
A dictionary containing arrays with which to override values for one or more variables (specified by the keys). This can be used for instance to avoid loading the values of an axis whose values are severely scattered across a large file.
- dimtypesdict, optional
A dictionary mapping dimension names to axis classes. The keys should be axis names as defined in the file; values should be one of:
an axis instance, which will be used directly
an axis class, which will be used to create a new instance with the values given by the file
a tuple of an axis class and a dictionary with keyword arguments to pass to that axis’ constructor
If dimtypes is not specified, an attempt is made to automatically identify the axis types (see optional cfmeta argument below)
- namemapdict, optional
A dictionary to map variable names as specified in the file (keys) to PyGeode variable names (values); also works for axes/dimensions
- varlistlist, optional
A list (of strings) specifying the variables that should be loaded into the data set (if the list is empty, all NetCDF variables will be loaded)
- cfmetaboolean
If true, an attempt to identify the type of each dimension is made following the CF metadata conventions.
- Returns
- dataset
A dataset containing the variables contained in the file or a dict of datasets. The variable data itself is not loaded into memory.
See also
Notes
The format of the file is automatically detected from the filename (if possible); otherwise it must be specified by the
format
argument. The identifiers used invarlist
anddimtypes
are the original names used in the NetCDF file, not the names given innamemap
. The optional arguments are not currently supported for netcdf4 files containing groups.
- pygeode.openall(files, format=None, opener=None, **kwargs)[source]¶
Returns a
Dataset
containing variables merged across multiple files.- Parameters
- filesstring, list, or tuple
Either a single filename or a list of filenames. Wildcards are supported,
glob.iglob()
is used to expand these into an explicit list of files.- formatstring, optional
String specifying format of file to open. If none is given the format will be automatically detected from the first filename (see
autodetectformat()
)- openerfunction, optional
Function to open individual files. If none is provided, uses the format-specific version of
open()
. The datasets returned by this function are then concatenated and returned. See Notes.- sortedboolean, optional
If True, the filenames are sorted (by alpha) prior to opening each file, and the axes on the returned dataset are sorted by calling
Dataset.sorted()
.- **kwargskeyword arguments
These are passed on to the function
opener
;
- Returns
- dataset
A dataset containing the variables concatenated across all specified files. The variable data itself is not loaded into memory.
See also
Notes
The function
opener
must take a single positional argument - the filename of the file to open - and keyword arguments that are passed through from this function. It must return aDataset
object with the loaded variables. By default the standardopen()
is used, but providing a custom opener can be useful for any reshaping of the variables that must be done prior to concatenating the whole dataset.Once every file has been opened, the resulting datasets are concatenated using
dataset.concat()
.This function is best suited for a moderate number of files. Because each file must be explicitly opened to read the metadata, even this can take a significant amount of time if a large number of files are being opened. For these cases using
open_multi()
can be much more efficient, though it requires more coding effort initially. The underlying concatenation is also more efficient when the data is actually accessed.
- pygeode.open_multi(files, format=None, opener=None, pattern=None, file2date=None, **kwargs)[source]¶
Returns a
Dataset
containing variables merged across many files.- Parameters
- filesstring, list, or tuple
Either a single filename or a list of filenames. Wildcards are supported,
glob.iglob()
is used to expand these into an explicit list of files.- formatstring, optional
String specifying format of file to open. If none is given the format will be automatically detected from the first filename (see
autodetectformat()
)- openerfunction, optional
Function to open individual files. If none is provided, uses the format-specific version of
open()
. The datasets returned by this function are then concatenated and returned. See Notes.- patternstring, optional
A regex pattern to extract date stamps from the filename; used by default file2date. Matching patterns must be named <year>, <month>, <day>, <hour> or <minute>. Abbreviations are available for the above; $Y matches a four digit year, $m, $d, $H, and $M match a two-digit month, day, hour and minute, respectively.
- file2datefunction, optional
Function which returns a date dictionary given a filename. By default this is produced by applying the regex pattern
pattern
to the filename.- sortedboolean, optional
If True, the filenames are sorted (by alpha) prior to opening each file, and the axes on the returned dataset are sorted by calling
Dataset.sorted()
.- **kwargskeyword arguments
These are passed on to the function
opener
;
- Returns
- dataset
A dataset containing the variables concatenated across all specified files. The variable data itself is not loaded into memory.
Notes
This is intended to provide access to large datasets whose files are separated by timestep. To avoid opening every file individually, the time axis is constructed by opening the first and the last file in the list of files provided. This is done to provide a template of what variables and what times are stored in each file - it is assumed that the number of timesteps (and their offsets) is the same accross the whole dataset. The time axis is then constructed from the filenames themselves, using the function
file2date
to generate a date from each filename. As a result only two files need to be opened, which makes this a very efficient way to work with very large datasets.However, no explicit check is made of the integrity of the files - if there are corrupt or missing data within individual files, this will not become clear until that data is actually accessed. This can be done explicitly with
check_multi()
, which explicitly attempts to access all the data and returns a list of any problems encountered; this can take a long time, but is a useful check (and is more likely to provide helpful error messages).The function
opener
must take a single positional argument - the filename of the file to open - and keyword arguments that are passed through from this function. It must return aDataset
object with the loaded variables. By default the standardopen()
is used, but providing a custom opener can be useful for any reshaping of the variables that must be done prior to concatenating the whole dataset.
- pygeode.save(filename, dataset, format=None, cfmeta=True, **kwargs)[source]¶
Saves a
Var
orDataset
to file.- Parameters
- filenamestring
Path of file to save to.
- dataset
Var
,Dataset
, or collection ofVar
objects or dict of
Dataset
objects. The dataset is consolidated usingdataset.asdataset()
. Dicts of Dataset objects are written as groups to netcdf4 files.- formatstring, optional
String specifying format of file to open. If none is given the format will be automatically detected from the file (see
autodetectformat()
)- cfmetaboolean
If true, metadata is automatically written specifying the axis dimensions following CF metadata conventions.
Notes
The format of the file is automatically detected from the filename (if possible). The NetCDF format is at present the best supported.