Working with Datasets

PyGeode Var objects can be grouped together into Dataset objects. This is convenient for dealing with, for instance, the contents of a large NetCDF file which defines multiple variables, but datasets can be used to perform a number of bulk operations on all (or some) of the variables they contain, which can also be quite powerful.

As a simple example, let’s return to the second dataset in the tutorial module, and select a single timestep:

In [1]: import pygeode as pyg, numpy as np

In [2]: from pygeode.tutorial import t2

In [3]: print(t2(time='1 Sep 2010'))
<Dataset>:
Vars:
  Temp (time,pres,lat,lon)  (1,20,31,60)
  U    (time,pres,lat,lon)  (1,20,31,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

As you can see this returns a new dataset with the appropriate selection from each variable contained in the dataset. Many operations defined for single variables have equivalent versions that act on whole datasets:

In [4]: import pygeode as pyg

In [5]: from pygeode.tutorial import t2

In [6]: print(t2.mean('time'))
<Dataset>:
Vars:
  Temp (pres,lat,lon)  (20,31,60)
  U    (pres,lat,lon)  (20,31,60)
Axes:
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

In [7]: print(t2.transpose('time', 'lon', 'lat', 'pres'))
<Dataset>:
Vars:
  Temp (time,lon,lat,pres)  (3650,60,31,20)
  U    (time,lon,lat,pres)  (3650,60,31,20)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  lon <Lon>      :  0 E to 354 E (60 values)
  lat <Lat>      :  90 S to 90 N (31 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

In [8]: print(t2.extend(0, pyg.NamedAxis(name = 'member', values=np.arange(5))))
<Dataset>:
Vars:
  Temp (member,time,pres,lat,lon)  (5,3650,20,31,60)
  U    (member,time,pres,lat,lon)  (5,3650,20,31,60)
Axes:
  member <NamedAxis 'member'>:  0  to 4  (5 values)
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

If you have a custom operation you need to perform, or perhaps a more complicated set of operations, this can also be done. Write a function that takes as its first argument a variable from the dataset, and returns a new, modified variable, then carry out this operation using Dataset.map(). As a simple example, consider the following

In [9]: def sel(v, lat=0):
   ...:   return v(s_lat=lat).rename(v.name + '_' + v.lat.formatvalue(lat, '%dN'))
   ...: 

In [10]: t_eq = t2.map(sel); print(t_eq)
<Dataset>:
Vars:
  Temp_EQ (time,pres,lon)  (3650,20,60)
  U_EQ    (time,pres,lon)  (3650,20,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

# You can also pass additional arguments, either by keyword
In [11]: t_5s = t2.map(sel, lat=-5); print(t_5s)
<Dataset>:
Vars:
  Temp_5S (time,pres,lon)  (3650,20,60)
  U_5S    (time,pres,lon)  (3650,20,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

# or as a positional argument
In [12]: t_5n = t2.map(sel, 5); print(t_5n)
<Dataset>:
Vars:
  Temp_5N (time,pres,lon)  (3650,20,60)
  U_5N    (time,pres,lon)  (3650,20,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

In more complicated datasets, this can be very useful for operating only on a subset of the variables (for instance, only those)

One can then combine these datasets

In [13]: print(t_5s + t_5n)
<Dataset>:
Vars:
  Temp_5S (time,pres,lon)  (3650,20,60)
  U_5S    (time,pres,lon)  (3650,20,60)
  Temp_5N (time,pres,lon)  (3650,20,60)
  U_5N    (time,pres,lon)  (3650,20,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

In [14]: print(t_5n.rename_vars(Temp_5N = 'T_5n'))
<Dataset>:
Vars:
  T_5n (time,pres,lon)  (3650,20,60)
  U_5N (time,pres,lon)  (3650,20,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode