2.64.1.2 Algorithm for Identify Data Distribution

In this app, MLE (maximum likelihood estimation) method is used to estimate parameters for each distribution except Gaussian Mixture distribution, and the latter uses EM (expectation maximization) algorithm.

In the MLE method, for input data \mathbf{y}, parameters vector \theta:

  1. First calculate the likelihood function \mathcal{L}_{n}( \theta \,;\mathbf {y} ) for the distribution, which is the product of probability density functions of input data.
  2. Then maximize the logarithm of the likelihood function \ell ( \theta \,;\mathbf {y} ) by setting its partial derivatives equal to zero with respect to parameters.
  3. Use Newton-Raphson method to solve these parameters in step 2.

Distributions

In this app, distributions include:

Distribution Parameters Probability density function Cumulative distribution function Inverse cumulative distribution function
Normal location: \mu
scale: \sigma
f(x)=\frac{1}{ \sqrt{2 \pi} \sigma } e^{ -\frac{( x - \mu )^2}{2 \sigma^2} }, \; \sigma > 0 P(x) = \Phi \left( \frac{x - \mu}{\sigma} \right) x = \mu + \sigma \Phi^{-1}(p)
Lognormal location: \mu
scale: \sigma
f(x)=\frac{1}{ \sqrt{2 \pi} \sigma x } e^{ -\frac{( \text{ln}(x) - \mu )^2}{2 \sigma^2} },

 x>0 , \; \sigma > 0

P(x) = \Phi \left( \frac{\text{ln}(x) - \mu}{\sigma} \right) \text{ln}(x) = \mu + \sigma \Phi^{-1}(p)
3-Parameter Lognormal location: \mu
scale: \sigma
threshold: \lambda
f(x)=\frac{1}{\sqrt{2 \pi} \sigma (x -\lambda)} e^{ -\frac{( \text{ln}(x -\lambda) - \mu )^2}{2 \sigma^2} },

x > \lambda , \; \sigma > 0

P(x) = \Phi \left( \frac{\text{ln}(x - \lambda) - \mu}{\sigma} \right) \text{ln}(x - \lambda) = \mu + \sigma \Phi^{-1}(p)
Gamma shape: \alpha
scale: \beta
f(x) = \frac{ x^{\alpha-1} e^{-\frac{x}{\beta}}}{ \beta^{\alpha} \Gamma(\alpha) },

x > 0, \; \alpha > 0, \; \beta > 0

P(x) = \frac{1}{ \Gamma(\alpha) } \gamma( \alpha, \frac{x}{\beta}) x = \beta  \gamma^{-1}( \alpha, \Gamma(\alpha) p )
Exponential scale: \theta f(x) = \frac{ 1 }{ \theta } e^{-\frac{x}{\theta}}, \; x > 0, \; \theta > 0 P(x) = 1- e^{ -\frac{x}{\theta} }  x = \theta ( - \text{ln}(1-p) )
2-Parameter Exponential scale: \theta
threshold: \lambda
f(x) = \frac{ 1 }{ \theta } e^{-\frac{x - \lambda}{\theta}}, \; x > \lambda, \; \theta > 0 P(x) = 1- e^{ -\frac{x - \lambda}{\theta} }  x = \lambda + \theta ( - \text{ln}(1-p) )
Smallest Extreme Value location: \alpha
scale: \beta
f(x) = \frac{1}{\beta } e^{ \frac{x- \alpha}{\beta}} e^{ - e^{ \frac{x- \alpha}{\beta} } },

\beta > 0

P(x) = 1- e^{-e^{\frac{x-\alpha}{\beta}}}  x = \alpha + \beta \text{ln}( - \text{ln}(1-p) )
Weibull scale: \alpha
shape: \beta
f(x) = \frac{\beta}{\alpha} \left(\frac{ x }{ \alpha }\right)^{\beta - 1} e^{ -(\frac{ x }{ \alpha })^\beta },

x > 0, \; \alpha > 0, \; \beta > 0

P(x) = 1 - e^{ -(\frac{ x }{ \alpha })^\beta }  \text{ln}(x) = \text{ln}(\alpha) + \frac{1}{\beta} \text{ln}( - \text{ln}(1-p) )
3-Parameter Weibull scale: \alpha
shape: \beta
threshold: \lambda
f(x) = \frac{\beta}{\alpha} \left(\frac{ x - \lambda }{ \alpha }\right)^{\beta - 1} e^{ -\left(\frac{ x - \lambda }{ \alpha }\right)^\beta },

x > \lambda, \; \alpha > 0, \; \beta > 0

P(x) = 1 - e^{ -\left(\frac{ x - \lambda }{ \alpha }\right)^\beta }  \text{ln}(x - \lambda) = \text{ln}(\alpha) + \frac{1}{\beta} \text{ln}( - \text{ln}(1-p) )
Largest Extreme Value location: \alpha
scale: \beta
f(x) =  \frac{1}{\beta } e^{- \frac{x- \alpha}{\beta}} e^{ - e^{ -\frac{x- \alpha}{\beta} } },

\beta > 0

P(x) = e^{-e^{-\frac{x-\alpha}{\beta}}}  x = \alpha - \beta \text{ln}( - \text{ln}(p) )
Logistic location: \mu
scale: \sigma
f(x) =  \frac{e^{\frac{x- \mu}{\sigma}}}{\sigma(1+e^{\frac{x- \mu}{\sigma}})^2 },\sigma > 0 P(x) =  \frac{1}{1+e^{-\frac{x- \mu}{\sigma}}} x =\mu+  \sigma\text{ln}(\frac{p}{1-p})
Loglogistic location: \mu
scale: \sigma
\begin{array}{l}f(x) =  \frac{e^{\frac{\text{ln}(x)- \mu}{\sigma}}}{x\sigma(1+e^{\frac{\text{ln}(x)- \mu}{\sigma}})^2 } , \\ x>0, \sigma > 0\end{array} P(x) =  \frac{1}{1+e^{-\frac{\text{ln}(x)- \mu}{\sigma}}} x =e^{\mu+  \sigma\text{ln}(\frac{p}{1-p})}
Folded Normal location: \mu
scale: \sigma
\begin{array}{l}f(x) = \frac{1}{ \sqrt{ 2 \pi } \sigma} e^{ - \frac{(x - \mu)^2}{ 2 \sigma^2 } } \\ + \frac{1}{ \sqrt{ 2 \pi } \sigma} e^{ - \frac{(x + \mu)^2}{ 2 \sigma^2 } }, x \geqslant 0, \; \sigma > 0\end{array} \begin{matrix}P(x) =  \Phi \left( \frac{x + \mu}{\sigma} \right) \\ - \Phi \left( \frac{\mu - x}{\sigma} \right)\end{matrix} x = \text{foldnorminv}(p, \, \mu, \, \sigma)
Rayleigh scale: \sigma f(x) =  \frac{x}{\sigma^2}e^{-\frac{x^2}{2\sigma^2}},x\geqslant0, \sigma > 0 P(x) =  1-e^{-\frac{x^2}{2\sigma^2}} x=\sqrt{2}\sigma \sqrt{-\text{ln}(1-p)}
Gaussian Mixture Number of components: g, For ith component,
weight: w_i
location: \mu_i
scale: \sigma_i
\begin{array}{l}f(x) = \sum_{i=1}^g \frac{w_i}{ \sqrt{2 \pi} \sigma_i } e^{ -\frac{( x - \mu_i )^2}{2 \sigma_i^2} }, \\ 0 < w_i < 1, \; \sigma > 0\end{array} P(x) = \sum_{i=1}^g w_i \Phi \left( \frac{x - \mu_i}{\sigma_i} \right) No explicit solution. It can be calculated by interpolation.

where Φ is the CDF function for the standard normal distribution, and γ is the lower incomplete gamma function.

Note:

  • Box-Cox Transformation, Johnson Transformation and Yeo-Johnson Transformation distributions transform data first, and then use Normal distribution to fit. For more, see Algorithm for Data Transformation.
  • The scale parameter in Normal and Lognormal distributions is corrected to the unbiased standard deviation, i.e. divided by n-1 instead of n, where n is the number of points.
  • The threshold parameter λ in 2-Parameter Exponential distribution is estimated as below:

\lambda = X_{(1)} - \frac{\bar{X} - X_{(1)}}{n-1} ,

where X(1) is the minimum of X, and X is the mean of X.

Goodness of Fit

Anderson-Darling Test

The hypotheses for Anderson-Darling test are:

H0: Input data follows the distribution.
H1: Input data doesn't follow the distribution.
  • Anderson-Darling Statistic
Input data is sorted first, and use CDF function to calculate the probability Zi for each point. Anderson-Darling statistic A2 is calculated as below:

 A^2 = -n -(1/n) \sum_{i=1}^n \left[ (2i-1) \text{ln} (Z_i) + ( 2n+1-i ) \text{ln} (1 - Z_i) \right]

A2 value can be used to assess the fit. The smaller the value is, the better the model will be.
  • P-value for Anderson-Darling Test
P-value is calculated by interpolating the statistic A2 on tables in the reference [1] and [2]. If P-value is less than the critical value, e.g. 0.05, H0 will be rejected.
For some distributions, no reference is available, their P-values are missing. e.g. 3-Parameter Lognormal distribution.
For Folded Normal, Gaussian Mixture and 3-Parameter Weibull with fixed shape distributions, P-value is calculated by the Monte Carlo method. 1000 samples are drawn to produce the statistic for the null hypothesized distribution with estimated parameters. The P-value is the proportion of samples whose A2 statistic values are greater than or equal to input data's [4]. i.e.

\text{P-value} = \frac{b + 1}{N + 1} ,

where b is the number of statistic values that are greater than or equal to input data's, and N is the number of samples, here N=1000.

BIC

In Gaussian Mixture distribution, BIC value is calculated as follows:

\text{BIC} = -2 \, \ell + k \text{ln}(n) ,

where is the logarithm of the likelihood in the estimation, k is the number of parameters, and n is the number of points.


The smaller the BIC value is, the better the model will be.

Likelihood-ratio Test(LRT)

Likelihood-ratio test(LRT) is used to compare two models: a simple model A and a complex model B, and model A is the subset of model B, e.g. model A is Weibull, and model B is 3-Parameter Weibull. It can determine whether the complex model B is significantly better than the simple model A. In this app, likelihood-ratio test supports to compare Lognormal vs 3-Parameter Lognormal, Exponential vs 2-Parameter Exponential, and Weibull vs 3-Parameter Weibull.

The hypotheses for the likelihood-ratio test are:

H0: The complex model B is the same as the simple model A.
H1: The complex model B is significantly better than the simple model A.

If P-value for the likelihood-ratio test is less than the critical value, e.g. 0.05, reject H0, and it indicates that the complex model B is significantly better than the simple model A

  • Likelihood-ratio test statistic

\text{LRT} = -2 \, ( \ell(A) - \ell(B) ) ,

where (A) and (B) are log likelihoods for the model A and the model B respectively.
  • Likelihood-ratio test P-value

Likelihood-ratio test statistic follows Chi-squared distribution with the degrees of freedom df, and df is the difference of number of parameters in the model B and the model A. Thus likelihood-ratio test P-value can be calculated as follows:

\text{P-val} = \text{chi2cdf}( LRT, 1, 1) ,

where chi2cdf is the upper tail (The third parameter 1 means the upper tail) CDF function for the 𝜒2 distribution.

Estimate Parameter's Standard Error

Once parameters are estimated by MLE method, the Fisher information matrix (FIM) can be calculated by Hessian matrix:

 [\mathcal{I}(\theta)]_{i,j} = - E\left[ \frac{\partial^2}{\partial \theta_i \, \partial \theta_j} \text{ln}\left( f(Y; \theta) \right) \, \Big| \, \theta \right]

Covariance matrix of parameters can be expressed as:

C = (n \mathcal{I} )^{-1}

where n is the number of points.

Parameter's standard error is the square root of the diagonal elements in the covariance matrix C.

Probability Plot

Points in the probability plot are sorted in ascending order, for the ith point, its probability is calculated by the median rank (Benard's method):

P = \frac{i - 0.3}{n + 0.4}


The middle line is the expected percentiles for given probabilities in terms of the inverse cumulative distribution function using parameters from the MLE method.

To estimate confidence limits of percentiles, variance of percentiles is calculated by propagation of error:

\sigma^2_{x_p} = \left( \frac{\partial x_p}{\partial \theta} \right)^T C \left( \frac{\partial x_p}{\partial \theta} \right)

where xp is the percentile for a given probability p, and (.)T denotes the transpose of a vector.

  • Confidence limits of percentiles can be calculated as below:

x_{pL} = x_p - Z_{\alpha} \sigma_{x_p}

x_{pU} = x_p + Z_{\alpha} \sigma_{x_p}

where Z_{\alpha} = \Phi^{-1}(1- \alpha / 2) .

  • For some positive random variables, e.g. in Lognormal, Gamma, Exponential, Weibull, Loglogistic, Folded Normal and Rayleigh distributions, confidence limits are:

x_{pL} = e^{ \text{ln}(x_p) - Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }

x_{pU} = e^{ \text{ln}(x_p) + Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }

  • For some random variables with a threshold parameter λ, e.g. in 3-Parameter Lognormal, 2-Parameter Exponential and 3-Parameter Weibull distributions, confidence limits are:
If \lambda < 0 ,

x_{pL} = x_p - Z_{\alpha} \sigma_{x_p}

x_{pU} = x_p + Z_{\alpha} \sigma_{x_p}

If \lambda \ge 0 ,

x_{pL} = e^{ \text{ln}(x_p) - Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }

x_{pU} = e^{ \text{ln}(x_p) + Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }

If  x_{pL} < \lambda , xpL value will be corrected to λ .
  • Q-Q plot is shown for Gaussian Mixture distribution. Its expected percentiles are calculated by interpolation.

Reference

  1. R.A. Lockhart and M.A. Stephens (1994). "Estimation and Tests of Fit for the Three-parameter Weibull Distribution". Journal of the Royal Statistical Society, Vol. 56, No. 3, pp. 491-500.
  2. Ralph B. D'Agostino, Michael A. Stephens (Eds.) (1986). Goodness-of-Fit Techniques. New York: Marcel Dekker
  3. Hartigan J A (1975). Clustering Algorithms. Wiley
  4. W. Stute, W. G. Manteiga, and M. P. Quindimil (1993). “Bootstrap based goodness-of-fit-tests”. Metrika, 40.1: pp. 243-256.