Getting Started¶

We’ll start by taking a look at a couple of synthetic datasets defined in the tutorial module of PyGeode. The first such dataset can be imported as follows:

In [1]: import pygeode as pyg

In [2]: from pygeode.tutorial import t1

In [3]: print(t1)
<Dataset>:
Vars:
  Temp (lat,lon)  (31,60)
Axes:
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature data generated by pygeode

The dataset t1 contains one pygeode variable object, Temp, which is defined on a two dimensional grid of 31 latitudes and 60 longitudes. The grid is defined by the two pygeode axes objects, lat and lon, which span from 85 S to 85 N, and from 0 E to 354 E. respectively.

We can take a closer look at our variable Temp:

In [4]: print(t1.Temp)
<Var 'Temp'>:
  Units: K  Shape:  (lat,lon)  (31,60)
  Axes:
    lat <Lat>      :  90 S to 90 N (31 values)
    lon <Lon>      :  0 E to 354 E (60 values)
  Attributes:
    {}
  Type:  Add_Var (dtype="float64")

Again, we see both physical and numerical details of the grid on which this variable is defined, as well that the temperatures are defined in degrees Kelvin.

Finally, we can look as well at one of our axes:

In [5]: print(t1.lat)
lat <Lat>      :  90 S to 90 N (31 values)

which we could also access as a member of the variable Temp: t1.Temp.lat.

These three types of objects, variables, axes, and datasets, are the core of PyGeode. Variables (Var) encapsulate a scalar field such as temperature or one component of the wind field. Axes (Axis), which are in fact special cases of variables, define the geophysical grid on which the scalar field exists. Finally, datasets (Dataset) are containers for variables. These concepts match reasonably closely to their counterparts in NetCDF files (and indeed PyGeode works very naturally with NetCDF files). We’ll learn more about all three soon, but to get a quick feel of what PyGeode can do, let’s try plotting the contents of our first variable:

# Depending on what python interpreter you are using, you may need to run
# these commands to access the plotting functionality of matplotlib
In [6]: import pylab as pyl; pyl.ion();

In [7]: pyg.showvar(t1.Temp)
Out[7]: <pygeode.plot.wrappers.AxesWrapper at 0x7f37362c5070>

PyGeode annotates the axes automatically, and if the Cartopy package is installed, data on a lat-lon grid is plotted over the outline of the continents using a cylindrical projection. All plots generated by PyGeode are in fact generated using the matplotlib library; the plotvar command simply makes many educated guesses about the type of plot you might want to generate based on the variable you’ve passed in. Many aspects of the plot can be customized through the plotvar interface (which we’ll get into later on in this tutorial), but if you find you can’t tweak things just so, the plot can always be manipulated using the matplotlib library itself which gives you full control.

Now let’s take a look at a second dataset:

In [8]: from pygeode.tutorial import t2

In [9]: print(t2)
<Dataset>:
Vars:
  Temp (time,pres,lat,lon)  (3650,20,31,60)
  U    (time,pres,lat,lon)  (3650,20,31,60)
Axes:
  time <ModelTime365>:  Jan 1, 2011 00:00:00 to Dec 31, 2020 00:00:00 (3650 values)
  pres <Pres>    :  1000 hPa to 50 hPa (20 values)
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
  history        : Synthetic Temperature and Wind data generated by pygeode

This is a somewhat more complicated temperature field, now defined on four dimensions: time, pressure, latitude and longitude. Note this grid has over 350 million data points–enough that the 2010-era laptop this tutorial was first written on would complain if we attempted to load all of it into memory at once. This brings us to one of the major guiding assumptions of pygeode - that since we haven’t done anything that needs the data itself, none of it has yet been loaded or computed. We can, nonetheless, manipulate this variable as if it has:

In [10]: Tc = pyg.climatology(t2.Temp)    # Compute the climatology

In [11]: Tcz = Tc.mean('lon')        # Compute the zonal mean

In [12]: print(Tcz)
<Var 'Temp_clim_mean'>:
  Shape:  (time,pres,lat)  (365,20,31)
  Axes:
    time <ModelTime365>:  Jan 1,  00:00:00 to Dec 31,  00:00:00 (365 values)
    pres <Pres>    :  1000 hPa to 50 hPa (20 values)
    lat <Lat>      :  90 S to 90 N (31 values)
  Attributes:
    {}
  Type:  MeanVar (dtype="float64")

In [13]: Tcp = Tc - Tcz              # Compute anomaly from the zonal mean

In [14]: Tcp = Tcp.rename("T'")      # Rename variable

In [15]: print(Tcp)
<Var 'T''>:
  Shape:  (time,pres,lat,lon)  (365,20,31,60)
  Axes:
    time <ModelTime365>:  Jan 1,  00:00:00 to Dec 31,  00:00:00 (365 values)
    pres <Pres>    :  1000 hPa to 50 hPa (20 values)
    lat <Lat>      :  90 S to 90 N (31 values)
    lon <Lon>      :  0 E to 354 E (60 values)
  Attributes:
    {}
  Type:  RenamedVar (dtype="float64")

We’ve now computed (at least an abstract sense) the climatolgical anomaly from the zonal mean. Note that Tcz is defined on a reduced grid (we’ve lost the Lon axis), and it looks and behaves just like any other PyGeode variable; PyGeode has some automatic logic for properly aligning Tc and Tcz when taking their difference. However, no data has yet been loaded, no actual averages have been taken, and no differences computed, since we haven’t done anything that needs the data. If we now go and plot the near-surface, high-latitude anomaly of our synthetic dataset:

In [16]: pyg.showvar(Tcp(pres=1000, time='15 Jun 00'))
Out[16]: <pygeode.plot.wrappers.AxesWrapper at 0x7f37341da700>

We now actually need to perform these operations (though only on one pressure level and one latitude), so PyGeode goes back and accesses the required data (only), then carries out all of our operations before sending the data off to matplotlib to produce a nice contour plot.

Implied in the operations above is another principle of PyGeode. Note that nowhere did we need to remember which dimension of an array corresponded to time and which corresponded to longitude; nor did we need to know what index described the 1000 hPa pressure level, or which corresponded to the 15th of June. Moreover, we could take the difference between two fields (Tc and Tcz) which weren’t defined on the same grid–PyGeode takes care of the broadcasting and alignment of the underlying numpy arrays for us, leaving us to think about the problem we’re trying to solve in the physical coordinate space of our data. The underlying mapping is of course still there, and if it’s ever more convenient to think in the space of the data arrays, you are still free to do so; the relevant syntax and commands are described later in this tutorial.

‘This is all fine and good,’ you may be thinking, ‘but how do I work with my own data?’ PyGeode supports a number of data formats, though it works most naturally with NetCDF (or HDF5) files. As an example, the synthetic dataset used above can be written out to a NetCDF file (this assumes that a subdirectory called ‘sample_data’ already exists in your working directory):

In [17]: pyg.save('sample_data/file.nc', t1)

and then read back in using:

In [18]: ds = pyg.open('sample_data/file.nc')

In [19]: print(ds)
<Dataset>:
Vars:
  Temp (lat,lon)  (31,60)
Axes:
  lat <Lat>      :  90 S to 90 N (31 values)
  lon <Lon>      :  0 E to 354 E (60 values)
Global Attributes:
{}

PyGeode is aware of the metadata in the file and constructs variables and axes accordingly, following the CF metadata standard used for climate data. Datasets that span multiple files can be handled as well, and one can tailor these calls so that PyGeode properly recognizes your data regardless of the metadata in the file.

This concludes the introductory tutorial. We’ve seen the three basic objects in PyGeode: variables (Var), axes (Axis) and datasets (Dataset). We’ve seen examples of how PyGeode carries out all operations in a ‘lazy’ fashion, delaying all loading and processing of data until the output is explicitly required. We’ve seen a few examples of the automatic annotation of plots that PyGeode takes care of for you, and finally we’ve seen how to save and load data to a file. The next part of these tutorials gives a deeper introduction to the Basic Operations that one can perform on PyGeode variables.