Stats module¶
- pygeode.correlate(X, Y, axes=None, output='r,p', pbar=None)[source]¶
Computes Pearson correlation coefficient between variables X and Y.
- Parameters
- X, Y
Var
Variables to correlate. Must have at least one axis in common.
- axeslist, optional
Axes over which to compute correlation; if nothing is specified, the correlation is computed over all axes common to shared by X and Y.
- outputstring, optional
A string determining which parameters are returned; see list of possible outputs in the Returns section. The specifications must be separated by a comma. Defaults to ‘r,p’.
- pbarprogress bar, optional
A progress bar object. If nothing is provided, a progress bar will be displayed if the calculation takes sufficiently long.
- X, Y
- Returns
- results
Dataset
The names of the variables match the output request string (i.e. if
ds
is the returned dataset, the correlation coefficient can be obtained throughds.r2
).‘r’: The Pearson correlation coefficient \(\rho_{XY}\)
‘r2’: The coefficient of determination \(\rho^2_{XY}\)
‘p’: The p-value; see notes.
- results
Notes
The coefficient \(\rho_{XY}\) is computed following von Storch and Zwiers 1999, section 8.2.2. The p-value is the probability of finding a correlation coeefficient of equal or greater magnitude (two-sided) to the given result under the hypothesis that the true correlation coefficient between X and Y is zero. It is computed from the t-statistic given in eq (8.7), in section 8.2.3, and assumes normally distributed quantities.
- pygeode.regress(X, Y, axes=None, N_fac=None, output='m,b,p', pbar=None)[source]¶
Computes least-squares linear regression of Y against X.
- Parameters
- X, Y
Var
Variables to regress. Must have at least one axis in common.
- axeslist, optional
Axes over which to compute correlation; if nothing is specified, the correlation is computed over all axes common to X and Y.
- N_facinteger
A factor by which to rescale the estimated number of degrees of freedom; the effective number will be given by the number estimated from the dataset divided by
N_fac
.- outputstring, optional
A string determining which parameters are returned; see list of possible outputs in the Returns section. The specifications must be separated by a comma. Defaults to ‘m,b,p’.
- pbarprogress bar, optional
A progress bar object. If nothing is provided, a progress bar will be displayed if the calculation takes sufficiently long.
- X, Y
- Returns
- results
Dataset
The returned variables are specified by the
output
argument. The names of the variables match the output request string (i.e. ifds
is the returned dataset, the linear coefficient of the regression can be obtained byds.m
).A fit of the form \(Y = m X + b + \epsilon\) is assumed, and the following parameters can be returned:
‘m’: Linear coefficient of the regression
‘b’: Constant coefficient of the regression
‘r2’: Fraction of the variance in Y explained by X (\(R^2\))
‘p’: Probability of this fit under null hypothesis that true linear coefficient is zero
‘sm’: Standard deviation of linear coefficient estimate (\(\hat{\sigma}_E/\sqrt{S_{XX}}\))
‘se’: Standard deviation of residuals (\(\hat{\sigma}_E\))
- results
Notes
The statistics described are computed following von Storch and Zwiers 1999, section 8.3. The p-value ‘p’ is computed using the t-statistic given in section 8.3.8, and confidence intervals for the slope and intercept can be computed from ‘sm’ or ‘se’. The data is assumed to be normally distributed.
- pygeode.multiple_regress(Xs, Y, axes=None, N_fac=None, output='B,p', pbar=None)[source]¶
Computes least-squares multiple regression of Y against variables Xs.
- Parameters
- Xslist of
Var
instances Variables to treat as independent regressors. Must have at least one axis in common with each other and with Y.
- Y
Var
The dependent variable. Must have at least one axis in common with the Xs.
- axeslist, optional
Axes over which to compute correlation; if nothing is specified, the correlation is computed over all axes common to the Xs and Y.
- N_facinteger
A factor by which to rescale the estimated number of degrees of freedom; the effective number will be given by the number estimated from the dataset divided by
N_fac
.- outputstring, optional
A string determining which parameters are returned; see list of possible outputs in the Returns section. The specifications must be separated by a comma. Defaults to ‘B,p’.
- pbarprogress bar, optional
A progress bar object. If nothing is provided, a progress bar will be displayed if the calculation takes sufficiently long.
- Xslist of
- Returns
- resultstuple of floats or
Var
instances. The return values are specified by the
output
argument. The names of the variables match the output request string (i.e. ifds
is the returned dataset, the linear coefficient of the regression can be obtained byds.m
).A fit of the form \(Y = \sum_i \beta_i X_i + \epsilon\) is assumed. Note that a constant term is not included by default. The following parameters can be returned:
‘B’: Linear coefficients \(\beta_i\) of each regressor
‘r2’: Fraction of the variance in Y explained by all Xs (\(R^2\))
‘p’: p-value of regession; see notes.
‘sb’: Standard deviation of each linear coefficient
‘covb’: Covariance matrix of the linear coefficients
‘se’: Standard deviation of residuals
The outputs ‘B’, ‘p’, and ‘sb’ will produce as many outputs as there are regressors.
- resultstuple of floats or
Notes
The statistics described are computed following von Storch and Zwiers 1999, section 8.4. The p-value ‘p’ is computed using the t-statistic appropriate for the multi-variate normal estimator \(\hat{\vec{a}}\) given in section 8.4.2; it corresponds to the probability of obtaining the regression coefficient under the null hypothesis that there is no linear relationship. Note this may not be the best way to determine if a given parameter is contributing a significant fraction to the explained variance of Y. The variances ‘se’ and ‘sb’ are \(\hat{\sigma}_E\) and the square root of the diagonal elements of \(\hat{\sigma}^2_E (\chi^T\chi)\) in von Storch and Zwiers, respectively. The data is assumed to be normally distributed.
- pygeode.difference(X, Y, axes=None, alpha=0.05, Nx_fac=None, Ny_fac=None, output='d,p,ci', pbar=None)[source]¶
Computes the mean value and statistics of X - Y.
- Parameters
- X, Y
Var
Variables to difference. Must have at least one axis in common.
- axeslist, optional, defaults to None
Axes over which to compute means; if othing is specified, the mean is computed over all axes common to X and Y.
- alphafloat, optional; defaults to 0.05
Confidence level for which to compute confidence interval.
- Nx_facinteger, optional: defaults to None
A factor by which to rescale the estimated number of degrees of freedom of X; the effective number will be given by the number estimated from the dataset divided by
Nx_fac
.- Ny_facinteger, optional: defaults to None
A factor by which to rescale the estimated number of degrees of freedom of Y; the effective number will be given by the number estimated from the dataset divided by
Ny_fac
.- outputstring, optional
A string determining which parameters are returned; see list of possible outputs in the Returns section. The specifications must be separated by a comma. Defaults to ‘d,p,ci’.
- pbarprogress bar, optional
A progress bar object. If nothing is provided, a progress bar will be displayed if the calculation takes sufficiently long.
- X, Y
- Returns
- results
Dataset
The returned variables are specified by the
output
argument. The names of the variables match the output request string (i.e. ifds
is the returned dataset, the average of the difference can be obtained byds.d
). The following four quantities can be computed:‘d’: The difference in the means, X - Y
‘df’: The effective number of degrees of freedom, \(df\)
‘p’: The p-value; see notes.
‘ci’: The confidence interval of the difference at the level specified by
alpha
- results
See also
Notes
The effective number of degrees of freedom is estimated using eq (6.20) of von Storch and Zwiers 1999, in which \(n_X\) and \(n_Y\) are scaled by Nx_fac and Ny_fac, respectively. This provides a means of taking into account serial correlation in the data (see sections 6.6.7-9), but the number of effective degrees of freedom are not calculated explicitly by this routine. The p-value and confidence interval are computed based on the t-statistic in eq (6.19).
- pygeode.paired_difference(X, Y, axes=None, alpha=0.05, N_fac=None, output='d,p,ci', pbar=None)[source]¶
Computes the mean value and statistics of X - Y, assuming that individual elements of X and Y can be directly paired. In contrast to
difference()
, X and Y must have the same shape.- Parameters
- X, Y
Var
Variables to difference. Must share all axes over which the means are being computed.
- axeslist, optional
Axes over which to compute means; if nothing is specified, the mean is computed over all axes common to X and Y.
- alphafloat
Confidence level for which to compute confidence interval.
- N_facinteger
A factor by which to rescale the estimated number of degrees of freedom of X and Y; the effective number will be given by the number estimated from the dataset divided by
N_fac
.- outputstring, optional
A string determining which parameters are returned; see list of possible outputs in the Returns section. The specifications must be separated by a comma. Defaults to ‘d,p,ci’.
- pbarprogress bar, optional
A progress bar object. If nothing is provided, a progress bar will be displayed if the calculation takes sufficiently long.
- X, Y
- Returns
- results
Dataset
The returned variables are specified by the
output
argument. The names of the variables match the output request string (i.e. ifds
is the returned dataset, the average of the difference can be obtained byds.d
). The following four quantities can be computed:‘d’: The difference in the means, X - Y
‘df’: The effective number of degrees of freedom, \(df\)
‘p’: The p-value; see notes.
‘ci’: The confidence interval of the difference at the level specified by
alpha
- results
See also
Notes
Following section 6.6.6 of von Storch and Zwiers 1999, a one-sample t test is used to test the hypothesis. The number of degrees of freedom is the sample size scaled by N_fac, less one. This provides a means of taking into account serial correlation in the data (see sections 6.6.7-9), but the appropriate number of effective degrees of freedom are not calculated explicitly by this routine. The p-value and confidence interval are computed based on the t-statistic in eq (6.21).
- pygeode.isnonzero(X, axes=None, alpha=0.05, N_fac=None, output='m,p', pbar=None)[source]¶
Computes the mean value of X and statistics relevant for a test against the hypothesis that it is 0.
- Parameters
- X
Var
Variable to average.
- axeslist, optional
Axes over which to compute the mean; if nothing is specified, the mean is computed over all axes.
- alphafloat
Confidence level for which to compute confidence interval.
- N_facinteger
A factor by which to rescale the estimated number of degrees of freedom; the effective number will be given by the number estimated from the dataset divided by
N_fac
.- outputstring, optional
A string determining which parameters are returned; see list of possible outputs in the Returns section. The specifications must be separated by a comma. Defaults to ‘m,p’.
- pbarprogress bar, optional
A progress bar object. If nothing is provided, a progress bar will be displayed if the calculation takes sufficiently long.
- X
- Returns
- results
Dataset
The names of the variables match the output request string (i.e. if
ds
is the returned dataset, the mean value can be obtained throughds.m
). The following quantities can be calculated.‘m’: The mean value of X
‘p’: The probability of the computed value if the population mean was zero
‘ci’: The confidence interval of the mean at the level specified by alpha
If the average is taken over all axes of X resulting in a scalar, the above values are returned as a tuple in the order given. If not, the results are provided as
Var
objects in a dataset.
- results
See also
Notes
The number of effective degrees of freedom can be scaled as in
difference()
. The p-value and confidence interval are computed for the t-statistic defined in eq (6.61) of von Storch and Zwiers 1999.