Variable input and output

Reading from a single file

PyGeode was intended for dealing with large gridded datasets - nearly always such datasets will be serialized on disk, sometimes in a single file, but often spread over many files. PyGeode supports a number of different formats, and the commands for loading and saving data to disk are to a large extent independent of the format used, and attempts are made to automatically detect which format you are working with. This being said, NetCDF files are the most completely supported. We’ll start by looking at reading a single file.

The basics

The most basic form of reading a single file from disk is to simply call open(). We’ll open the file written out at the end of the Getting Started section of the tutorial:

In [1]: import pygeode as pyg, numpy as np

In [2]: ds = pyg.open('sample_data/file.nc')

In [3]: print(ds)
<Dataset>:
Vars:
  Temp (lat,lon)  (31,60)
Axes:
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
{}

As you can see, this returns a Dataset object with the contents of the file. The format of the file has been automatically detected. In this case the netcdf file was generated by an eariler call to save(), which fills in metadata recognized by PyGeode that indicate, for instance, what kind of PyGeode axis each NetCDF dimension coresponds to. Let’s take a look at the contents of this particular NetCDF file to get a quick sense of what metadata PyGeode recognizes and how it affects the dataset returned.

$ ncdump -h file.nc
netcdf file {
dimensions:
  lat = 31 ;
  lon = 60 ;
variables:
  double lat(lat) ;
    lat:units = "degrees_north" ;
    lat:standard_name = "latitude" ;
    lat:ancillary_variables = "lat_weights" ;
    lat:axis = "Y" ;
  double lon(lon) ;
    lon:units = "degrees_east" ;
    lon:standard_name = "longitude" ;
    lon:axis = "X" ;
  double Temp(lat, lon) ;
  double lat_weights(lat) ;

// global attributes:
    :history = "Synthetic Temperature data generated by pygeode" ;
}

We can see there is a set of metadata (NetCDF attributes) which PyGeode has interpreted, recognizing, for instance, that the NetCDF variables lat and lon represent coordinate axes, and should be treated as PyGeode Lat and Lon axes¸ respectively. Because these are coordinate axes, they are treated as (possibly child classes of) Axis objects, and not the more generic Var objects. The NetCDF variable Temp becomes a PyGeode variable, defined on the axes lat and lon.

There is one final NetCDF variable, lat_weights. This is referenced in the metadata of the lat coordinate variable, as an ancillary variable defining a set of weights for the latitude axis. PyGeode recognizes this and loads it into the weights member of the Lat axis, which is recognized when, for instance, taking averages (mean()) over the latitude axis.

In [4]: print(ds.lat.weights)
[6.12323400e-17 1.04528463e-01 2.07911691e-01 3.09016994e-01
 4.06736643e-01 5.00000000e-01 5.87785252e-01 6.69130606e-01
 7.43144825e-01 8.09016994e-01 8.66025404e-01 9.13545458e-01
 9.51056516e-01 9.78147601e-01 9.94521895e-01 1.00000000e+00
 9.94521895e-01 9.78147601e-01 9.51056516e-01 9.13545458e-01
 8.66025404e-01 8.09016994e-01 7.43144825e-01 6.69130606e-01
 5.87785252e-01 5.00000000e-01 4.06736643e-01 3.09016994e-01
 2.07911691e-01 1.04528463e-01 6.12323400e-17]

Overriding metadata

All of this is convenient when one has a well-formed NetCDF file, however, one often encounters rather bare NetCDF files in the wild, with either no or (even worse) incorrect metadata. In this case PyGeode will not be able to recognize what kind of Axis it should associate with each each axis:

In [5]: from pygeode.tutorial import t1

In [6]: pyg.save('sample_data/file_nometa.nc', t1, cfmeta=False)

In [7]: ds2 = pyg.open('sample_data/file_nometa.nc')

In [8]: print(ds2)
<Dataset>:
Vars:
  Temp (lat,lon)  (31,60)
Axes:
  lat <NamedAxis 'lat'>:  -90  to 90  (31 values)
  lon <NamedAxis 'lon'>:  0  to 354  (60 values)
Global Attributes:
{}


In [9]: pyg.showvar(ds2.Temp)
Out[9]: <pygeode.plot.wrappers.AxesWrapper at 0x7f37340173a0>
_images/t1Temp_nometadata.png

The lat and lon axes are now the default type NamedAxes. While this is still a perfectly useable PyGeode dataset, PyGeode no longer knows how to properly annotate and format plots.

However, if you know what the contents of the file represent, you can provide this information to PyGeode when opening it by passing special keyword arguments to open(). The most important of this is the dimtypes argument, which takes a dictionary whose keys are the names of the axes in the file being opened, and whose values tell PyGeode what kind of Axis class to associate with that dimension. This is perhaps best illustrated by example:

In [10]: dt = dict(lat = pyg.Lat, lon = pyg.regularlon(60))

In [11]: print(pyg.open('sample_data/file_nometa.nc', dimtypes=dt))
<Dataset>:
Vars:
  Temp (lat,lon)  (31,60)
Axes:
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
{}

We can see that now the NamedAxis objects have been replaced by the appropriate axis types, Lat and Lon, respectively. You will notice we have used two different ways of specifying the axes types. In the case of lat, we have simply passed the PyGeode class itself. In this case, PyGeode reads the values for the lat dimension from the file and uses these to create a new instance of Lat.

In contrast, for the lon axis, we have created an instance of the Lon class (with a helper function regularlon()) prior to opening the file. In this case, PyGeode simply uses the already-created axis, and avoids reading in the coordinates for the axis being overridden. Of course, the length of the axis must mach the length of the dimension in the file. This is useful for when the data in the file is not correct, but it can also happen that in very large files, depending on the details of how the file data has been structured, even just reading the coordinate metadata can be quite a slow operation. If you regularly work with such a file, passing in an already created axis instance can speed up call to open the file considerably.

There is one more form that can be used in the dimtypes dictionary, for when one wants to construct an axis using data from the file, but when the axis type in question requires additional arguments to be properly constructed. To give a contrived example, consider the following code (which is unfortunately a bit opaque):

In [12]: kwargs = dict(units='days', startdate = dict(year=2001, month=1, day=1))

In [13]: dt2 = dict(lon = (pyg.StandardTime, kwargs))

In [14]: print(pyg.open('sample_data/file_nometa.nc', dimtypes=dt2))
<Dataset>:
Vars:
  Temp (lat,time)  (31,60)
Axes:
  lat <NamedAxis 'lat'>:  -90  to 90  (31 values)
  time <StandardTime>:  Jan 1, 2001 00:00:00 to Dec 21, 2001 00:00:00 (60 values)
Global Attributes:
{}

Here we have decided to treat the lon axis describing a time axis, but for some reason we wish to use the coordinate data as actual time information. As will be explained in more detail in a later section (Working with Axes), time axes generally need a bit more information when they are constructed to define the calendar. We need give these additional arguments (the units and startdate items in the kwargs dictionary) which we pass in along with the class object itself. PyGeode then calls the constructor of the StandardTime with the values read from the file and with the additional keyword arguments to create a new axis object, which is then assigned to the lon dimension of any variable.

There are a few other arguments recognized by open(), but we’ll just mention one more, perhaps the second most useful: namemap. This allows you to simply rename variables and axes:

In [15]: nm = dict(lon = 'Longitude', Temp='Temperature')

In [16]: print(pyg.open('sample_data/file_nometa.nc', namemap=nm, dimtypes=dt))
<Dataset>:
Vars:
  Temperature (lat,Longitude)  (31,60)
Axes:
  lat <Lat>      :  90 S to 90 N (31 values)
  Longitude <Lon>:  0 E to 354 E (60 values)
Global Attributes:
{}

As you can see, the original axes names as defined in the file are still used to replace axis types in the dimtypes dictionary; the axis is renamed after it has been replaced by the newly specified axis class.

Reading from multiple files

PyGeode is intended for use with very large datasets, which are often distributed across many files. It has two additional methods, openall() and open_multi(), which are intended for use with such datasets. The first, which is the simpler of the two, is better suited for moderately large datasets which consist of multiple files, but not a large number of files. This works by opening each file explicitly, then merging the contents of each file into a single PyGeode dataset. Since each file is explicitly opened and its metadata read in, this can add up for datasets consisting of a large number of files. A subtler issue is that the concatenation PyGeode uses in this case is also, at present, somewhat less efficient than it could be. As a result, accessing the data opened with this method is not as fast as the second method.

The second, open_multi(), is somewhat more complicated to use, but is better suited for very large datasets, particularly in which the dataset is separated along the time axis. In this case PyGeode simply opens the first and last file in the dataset, reads in their metadata, then infers the contents of the rest of the files given an addition (user-defined) function that maps filenames to dates. Since only two files are opened, this is a very efficient operation even for extremely large datasets. As was just mentioned, the data is also loaded more efficiently than from datasets opened with the first method.

To demonstrate these two methods, we’ll need a sample dataset to work with. To keep the dataset small, we’ll write out the zonal mean of the temperature field from the second tutorial dataset, one year per file

In [17]: from pygeode.tutorial import t2

In [18]: for y in range(2011, 2021):
   ....:   pyg.save('sample_data/temp_zm_y%d.nc' % y, t2.Temp(year=y).mean('lon'))
   ....: 

This produces ten files, each with a year of data in them.

To demonstrate openall(), we can simply provide a wildcard filename which matches the filenames:

In [19]: ds = pyg.openall('sample_data/temp_zm_*.nc', namemap=dict(Temp='T'))

In [20]: print(ds.T)
<Var 'T'>:
  Shape:  (time,pres,lat)  (3650,20,31)
  Axes:
    time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
    pres <Pres>    :  1000 hPa to 50 hPa (20 values)
    lat <Lat>      :  90 S to 90 N (31 values)
  Attributes:
    {}
  Type:  SortedVar (dtype="float64")

PyGeode expands the wildcard, opens each of the files, then concatenates the datasets. The result is a single dataset object with 10 years of temperature data that you can use without worrying further about how the data is distributed on disk. The filenames can alternatively be specified as a list; in fact in this case PyGeode expands each item in the list if there are wildcards (using glob.glob()).

The same arguments (dimtypes and namemap) apply; here we’ve renamed the variable Temp to T. In addition, if you need to do more sophisticated manipulations of the data in each file before PyGeode concatenates the individual datasets, you can provide a function that takes the filename (as a string) as a single argument, and returns the modified dataset. This can be useful, for instance, for correcting issues in individual files:

In [21]: def opener(f):
   ....:   ds = pyg.open(f)
   ....:   if '2016' in f: # Replace the data in this file with a dummy value
   ....:     return pyg.Dataset([ds.Temp * 0 + 250.])
   ....:   else:
   ....:     return ds
   ....: 

In [22]: ds = pyg.openall('sample_data/temp_zm_*.nc', opener=opener)

In [23]: pyg.showvar(ds.Temp(lat=15, pres=500))
Out[23]: <pygeode.plot.wrappers.AxesWrapper at 0x7f373411aca0>
_images/t1Temp_modified.png

Using open_multi() is in most respects very similar. The main difference is that in addition to specifying the filenames, this method assumes that the dataset is divided along the time axis, and you need to provide a way for PyGeode to map each filename to a date.

In simple cases this can be done by simply specifying a regular expression that matches the components of the date in the filename. For instance:

In [24]: patt = 'temp_zm_y(?P<year>[0-9]{4}).nc'

In [25]: print(pyg.open_multi('sample_data/temp_zm_*.nc', pattern=patt))
<Dataset>:
Vars:
  Temp (time,pres,lat)  (3650,20,31)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
Global Attributes:
{}

The regular expression matches the four digit year in the filenames in a way that PyGeode can understand. Since this four digit format is commonly encountered, there is an abbreviation for it you can use which saves you the trouble of remembering how Python regular expressions work:

In [26]: patt = 'temp_zm_y$Y.nc'

In [27]: print(pyg.open_multi('sample_data/temp_zm_*.nc', pattern=patt))
<Dataset>:
Vars:
  Temp (time,pres,lat)  (3650,20,31)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
Global Attributes:
{}

Similar abbreviations exist for 2-digit month, day, hour, and minute fields (see open_multi() for details), though if the filenames you’re working with aren’t in the appropriate format, you’ll have to provide the full regular expression.

In some cases the filenames themselves don’t quite have enough information, or else it’s difficult to write an appropriate regular expression to parse out the right information. As an alternative, you can also specify a Python function that takes the filename as an argument, and returns the corresponding date dictionary. As a simple example,

In [28]: def f2d(fn):
   ....:   date = dict(month=1, day=1)
   ....:   date['year'] = int(fn[-7:-3]) # Extract the year from the filename and convert to an integer
   ....:   return date
   ....: 

In [29]: print(pyg.open_multi('sample_data/temp_zm_*.nc', file2date = f2d))
<Dataset>:
Vars:
  Temp (time,pres,lat)  (3650,20,31)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
Global Attributes:
{}

Note that in the interests of speed, PyGeode only opens the first and last file in the dataset, and then infers the contents of the remainder of the files. The assumption is made that the contents of each file are identical, only translated in time. If there are errors or missing data in the interior files, these will not be detected when opening the dataset. It is good practice when opening such a dataset for the first time to explicitly load in at least some of the data from each file in the dataset to confirm that all the files are well-formed. There is a helper function for this purpose, pygeode.formats.multifile.check_multi().

Saving to files

In contrast to reading files, saving PyGeode variables and datasets to disk is typically a much simpler operation, and we’ve seen examples of this already in a couple places in this tutorial. The save() method takes a filename and either a single Var object, or a Dataset object (in fact it uses the helper function asdataset() which recognizes various other kinds of collections of variables as well). As with reading files, writing to NetCDF files is the best supported option. By default, PyGeode writes NetCDF version 3 format files for greater compatibility, but this version has a number of limitations, including a maximum file size. Alternatively, version 4 format files can be written by specifying

In [30]: pyg.save('sample_data/file_v4.nc', t1, version=4)

which permits larger files to be written. There are several other options, including built-in compression (this is a feature of the NetCDF 4 library), and packing, which can be done in any format. See save() for details.

If you have a large dataset you wish to write out in multiple files, you can simply select the subset you wish before saving the variable or dataset to disk.