NAG Library Function Document

1Purpose

nag_simple_linear_regression (g02cac) performs a simple linear regression with or without a constant term. The data is optionally weighted.

2Specification

 #include #include
 void nag_simple_linear_regression (Nag_SumSquare mean, Integer n, const double x[], const double y[], const double wt[], double *a, double *b, double *a_serr, double *b_serr, double *rsq, double *rss, double *df, NagError *fail)

3Description

nag_simple_linear_regression (g02cac) fits a straight line model of the form,
 $E y = a + bx ,$
where $E\left(y\right)$ is the expected value of the variable $y$, to the data points
 $x 1 , y 1 , x 2 , y 2 , … , x n , y n ,$
such that
 $y i = a + bx i + e i , i = 1 , 2 , … , n n>2 .$
where the ${e}_{i}$ values are independent random errors. The $i$th data point may have an associated weight ${w}_{i}$, these may be used either in the situation when var $\left({\epsilon }_{i}\right)={\sigma }^{2}/{w}_{i}$ or if observations have to be removed from the regression by having zero weight or have been observed with frequency ${w}_{i}$.
The regression coefficient, $b$, and the regression constant, $a$ are estimated by minimizing
 $∑ i=1 n w i e i 2 ,$
if the weights option is not selected then ${w}_{i}=1.0$.
The following statistics are computed:
• the estimate of regression constant $\stackrel{^}{a}=\stackrel{-}{y}-\stackrel{^}{b}\stackrel{-}{x}$,
• the estimate of regression coefficient $\stackrel{^}{b}=\frac{\sum {w}_{i}\left({x}_{i}-\stackrel{-}{x}\right)\left({y}_{i}-\stackrel{-}{y}\right)}{\sum {w}_{i}{\left({x}_{i}-\stackrel{-}{x}\right)}^{2}}$,
• the residual sum of squares $rss=\sum {w}_{i}{\left({y}_{i}-{\stackrel{^}{y}}_{i}\right)}^{2}$,
where the weighted means $\stackrel{-}{x}$ and $\stackrel{-}{y}$ are
 $x - = ∑ w i x i ∑ w i and y - = ∑ w i y i ∑ w i .$
The number of degrees of freedom associated with $rss$ is
• $df=\sum {w}_{i}-2$ where ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$
• $df=\sum {w}_{i}-1$ where ${\mathbf{mean}}=\mathrm{Nag_AboutZero}$
Note: the weights should be scaled to give the correct degrees of freedom in the case var $\left({\epsilon }_{i}\right)={\sigma }^{2}/{w}_{i}$.
The ${R}^{2}$ value or coefficient of determination
 $R 2 = ∑ w i y ^ i - y - i 2 ∑ w i y i - y - 2 = ∑ w i y i - y - 2 - rss ∑ w i y i - y - 2 .$
This measures the proportion of the total variation about the mean $\stackrel{-}{y}$ that can be explained by the regression.
The standard error for the regression constant $\stackrel{^}{a}$
 $a_serr = rss df 1 ∑ w i + x - 2 ∑ w i x i - x - 2 = rss df 1 ∑ w i ∑ w i x i 2 ∑ w i x i - x - 2 .$
The standard error for the regression coefficient $\stackrel{^}{b}$
 $b_serr = rss df ∑ w i x i - x - 2 .$
Similar formulae can be derived for the case when the line goes through the origin, that is $a=0$.
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley

5Arguments

1:    $\mathbf{mean}$Nag_SumSquareInput
On entry: indicates whether nag_simple_linear_regression (g02cac) is to include a constant term in the regression.
${\mathbf{mean}}=\mathrm{Nag_AboutMean}$
The regression constant $a$ is included.
${\mathbf{mean}}=\mathrm{Nag_AboutZero}$
The regression constant $a$ is not included, i.e., $a=0$.
Constraint: ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$ or $\mathrm{Nag_AboutZero}$.
2:    $\mathbf{n}$IntegerInput
On entry: $n$, the number of observations.
Constraints:
• if ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$, ${\mathbf{n}}\ge 2$;
• if ${\mathbf{mean}}=\mathrm{Nag_AboutZero}$, ${\mathbf{n}}\ge 1$.
3:    $\mathbf{x}\left[{\mathbf{n}}\right]$const doubleInput
On entry: the values of the independent variable with the $\mathit{i}$th value stored in $x\left[\mathit{i}-1\right]$, for $\mathit{i}=1,2,\dots ,n$.
Constraint: all the values of $x$ must not be identical.
4:    $\mathbf{y}\left[{\mathbf{n}}\right]$const doubleInput
On entry: the values of the dependent variable with the $\mathit{i}$th value stored in $y\left[\mathit{i}-1\right]$, for $\mathit{i}=1,2,\dots ,n$.
Constraint: all the values of $y$ must not be identical.
5:    $\mathbf{wt}\left[{\mathbf{n}}\right]$const doubleInput
On entry: if weighted estimates are required then wt must contain the weights to be used in the weighted regression. Usually ${\mathbf{wt}}\left[i-1\right]$ will be an integral value corresponding to the number of observations associated with the $i$th data point, or zero if the $i$th data point is to be ignored. The sum of the weights therefore represents the effective total number of observations used to create the regression line.
If weights are not provided then wt must be set to NULL and the effective number of observations is n.
Constraint: if ${\mathbf{wt}}\phantom{\rule{0.25em}{0ex}}\text{is not}\phantom{\rule{0.25em}{0ex}}\mathbf{NULL}$, ${\mathbf{wt}}\left[\mathit{i}-1\right]=0.0$, for $\mathit{i}=1,2,\dots ,n$.
6:    $\mathbf{a}$double *Output
On exit: if ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$ then a is the regression constant $\stackrel{^}{a}$, otherwise a is set to zero.
7:    $\mathbf{b}$double *Output
On exit: the regression coefficient $\stackrel{^}{b}$.
8:    $\mathbf{a_serr}$double *Output
On exit: the standard error of the regression constant $\stackrel{^}{a}$.
9:    $\mathbf{b_serr}$double *Output
On exit: the standard error of the regression coefficient $\stackrel{^}{b}$.
10:  $\mathbf{rsq}$double *Output
On exit: the coefficient of determination, ${R}^{2}$.
11:  $\mathbf{rss}$double *Output
On exit: the sum of squares of the residuals about the regression.
12:  $\mathbf{df}$double *Output
On exit: the degrees of freedom associated with the residual sum of squares.
13:  $\mathbf{fail}$NagError *Input/Output
The NAG error argument (see Section 3.7 in How to Use the NAG Library and its Documentation).

6Error Indicators and Warnings

On entry, argument mean had an illegal value.
NE_INT_ARG_LT
On entry, ${\mathbf{n}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{n}}\ge 1$
if ${\mathbf{mean}}=\mathrm{Nag_AboutZero}$.
On entry, ${\mathbf{n}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{n}}\ge 2$
if ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$.
NE_NEG_WEIGHT
On entry, at least one of the weights is negative.
NE_SW_LOW
On entry, the sum of elements of wt must be greater than 1.0 if ${\mathbf{mean}}=\mathrm{Nag_AboutZero}$ or greater than 2.0 if ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$.
NE_WT_LOW
On entry, wt must contain at least 1 positive element if ${\mathbf{mean}}=\mathrm{Nag_AboutZero}$ or at least 2 positive elements if ${\mathbf{mean}}=\mathrm{Nag_AboutMean}$.
NE_X_OR_Y_IDEN
On entry, all elements of x and/or y are equal.
NE_ZERO_DOF_RESID
On entry, the degrees of freedom for the residual are zero, i.e., the designated number of arguments $\text{}=\text{}$ the effective number of observations.
Residual sum of squares is zero, i.e., a perfect fit was obtained.

7Accuracy

The computations are believed to be stable.

8Parallelism and Performance

nag_simple_linear_regression (g02cac) is not threaded in any implementation.

The time taken by the function depends on $n$. The function uses a two-pass algorithm.

10Example

A program to calculate regression constants, $\stackrel{^}{a}$ and $\stackrel{^}{b}$, the standard error of the regression constants, the regression coefficient of determination and the degrees of freedom about the regression.

10.1Program Text

Program Text (g02cace.c)

10.2Program Data

Program Data (g02cace.d)

10.3Program Results

Program Results (g02cace.r)

© The Numerical Algorithms Group Ltd, Oxford, UK. 2017