Working with Datasets¶
PyGeode Var
objects can be grouped together into Dataset
objects. This is convenient for dealing with, for instance, the contents of
a large NetCDF file which defines multiple variables, but datasets can be used
to perform a number of bulk operations on all (or some) of the variables they
contain, which can also be quite powerful.
As a simple example, let’s return to the second dataset in the tutorial module, and select a single timestep:
In [1]: import pygeode as pyg, numpy as np
In [2]: from pygeode.tutorial import t2
In [3]: print(t2(time='1 Sep 2010'))
<Dataset>:
Vars:
Temp (time,pres,lat,lon) (1,20,31,60)
U (time,pres,lat,lon) (1,20,31,60)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lat <Lat> : 90 S to 90 N (31 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
As you can see this returns a new dataset with the appropriate selection from each variable contained in the dataset. Many operations defined for single variables have equivalent versions that act on whole datasets:
In [4]: import pygeode as pyg
In [5]: from pygeode.tutorial import t2
In [6]: print(t2.mean('time'))
<Dataset>:
Vars:
Temp (pres,lat,lon) (20,31,60)
U (pres,lat,lon) (20,31,60)
Axes:
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lat <Lat> : 90 S to 90 N (31 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
In [7]: print(t2.transpose('time', 'lon', 'lat', 'pres'))
<Dataset>:
Vars:
Temp (time,lon,lat,pres) (3650,60,31,20)
U (time,lon,lat,pres) (3650,60,31,20)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
lon <Lon> : 0 E to 354 E (60 values)
lat <Lat> : 90 S to 90 N (31 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
In [8]: print(t2.extend(0, pyg.NamedAxis(name = 'member', values=np.arange(5))))
<Dataset>:
Vars:
Temp (member,time,pres,lat,lon) (5,3650,20,31,60)
U (member,time,pres,lat,lon) (5,3650,20,31,60)
Axes:
member <NamedAxis 'member'>: 0 to 4 (5 values)
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lat <Lat> : 90 S to 90 N (31 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
If you have a custom operation you need to perform, or perhaps a more
complicated set of operations, this can also be done. Write a function that
takes as its first argument a variable from the dataset, and returns a new,
modified variable, then carry out this operation using Dataset.map()
.
As a simple example, consider the following
In [9]: def sel(v, lat=0):
...: return v(s_lat=lat).rename(v.name + '_' + v.lat.formatvalue(lat, '%dN'))
...:
In [10]: t_eq = t2.map(sel); print(t_eq)
<Dataset>:
Vars:
Temp_EQ (time,pres,lon) (3650,20,60)
U_EQ (time,pres,lon) (3650,20,60)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
# You can also pass additional arguments, either by keyword
In [11]: t_5s = t2.map(sel, lat=-5); print(t_5s)
<Dataset>:
Vars:
Temp_5S (time,pres,lon) (3650,20,60)
U_5S (time,pres,lon) (3650,20,60)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
# or as a positional argument
In [12]: t_5n = t2.map(sel, 5); print(t_5n)
<Dataset>:
Vars:
Temp_5N (time,pres,lon) (3650,20,60)
U_5N (time,pres,lon) (3650,20,60)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
In more complicated datasets, this can be very useful for operating only on a subset of the variables (for instance, only those)
One can then combine these datasets
In [13]: print(t_5s + t_5n)
<Dataset>:
Vars:
Temp_5S (time,pres,lon) (3650,20,60)
U_5S (time,pres,lon) (3650,20,60)
Temp_5N (time,pres,lon) (3650,20,60)
U_5N (time,pres,lon) (3650,20,60)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode
In [14]: print(t_5n.rename_vars(Temp_5N = 'T_5n'))
<Dataset>:
Vars:
T_5n (time,pres,lon) (3650,20,60)
U_5N (time,pres,lon) (3650,20,60)
Axes:
time <ModelTime365>: Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
pres <Pres> : 1000 hPa to 50 hPa (20 values)
lon <Lon> : 0 E to 354 E (60 values)
Global Attributes:
history : Synthetic Temperature and Wind data generated by pygeode