ISSN: 2322-0066
Department of Statistics and Applied Mathematics, Federal University of Ceara Fortaleza, Brazil
Received date: 06/03/2017; Accepted date: 18/04/2017; Published date: 07/04/2017
Visit for more related articles at Research & Reviews: Research Journal of Biology
In 2006, Rathie and Swamee had proposed a generalization of the logistic distribution which is more flexible and multimodal. This work presents an addition of a new parameter to increase the flexibilization of the distribution as well as an asymmetric distribution using the Azzalini method, adding another parameter of asymmetry. Five data sets (Human Body Fat Index, HIV, Precipitation, pH Concentration, Relative Humidity) are analysed by applying the new distributions. The estimation of the parameters of the new distributions and mixture of the normals was accomplished by the automaximum likelihood method. Due to complex mathematical resources required to calculate the estimates of the new distributions, we use interactive numerical methods such as L-BFGS-B, BFGS, SANN etc. using an adaptive barrier algorithm added to enforce the constraint and an adapted function that searches for global maximum of a very complex non-linear objective function to initial values of the algorithm of estimation. All computational work was implemented in software R. In most cases, we use the Hartigan’s test to reject unimodality. Using the KolmogorovSmirnov test at significance level of 5% and applying various criteria, such as Mean Square Error, Mean Absolute Deviation and Maximum Deviation, to indicate the best fit. The classical and general method for multimodal adjustment is a mixture of distributions, in particular, the mixture of the normal distributions because the normal distribution presents good mathematical properties. In the case of mixture of the normals, we use EM algorithm to calculate the estimates. We also use Akaike Information Criterion and Bayesian Information Criterion as selection criteria to highlight the best distribution, in both cases, comparing them with the mixture of normal distributions to illustrate the applicability of the results derived in this paper.
Rathie-Swamee distribution, Azzalini method, multimodal data set analysis, Akaike criterion information, Bayesian criterion information, maximum likelihood method, Kolmogorov-Smirnov test
There are several classical models, such as normal, exponential, binomial, Poisson, logistic etc. to analyze different data sets. As there is not a single unified model, we have to construct new models suitable for the data sets under consideration. The logistic model is very useful in many areas in statistics and physics. This article is divided as follows: Section 2 deals with symmetric generalized logistic distribution whereas in Section 3 the skew form is studied. Section 4 presents applications to analyze five real data sets using the results of earlier sections and comparing them with the mixture of two normal distributions where possible. The article ends with a short conclusion and a list of references. Rathie et al.[1] defined a multimodal symmetric distribution function G(x) for a random variable X∼RS (a, b, p) as
1
With a and b not zeros simultaneously. For b=0 or when p=0, (1) is written as a logistic distribution
2
Where c=a or c=a + b. The density function corresponding to (1) is
3
A symmetric distribution can be generated by using the method proposed by Jones in 2004 [2]. Let U ~Beta (α,α), and X=G- 1(U), where G(x) is a distribution function of g(x). Then, the distribution function H(x) of X is given as
4
Differentiating H(x) yields the corresponding density function as
5
Using (1) and (3) in (5), the generalized symmetric logistic density function for X ~ RSG (a, b, p, α) is given by
6
Where both a and b not zeros simultaneously and B (., .) is the beta function. For α=1, reduce to (3). We may introduce the location parameter μ in the model (6). There is no need to introduce the scale parameter, otherwise the density function will become non-identifiable. The density function (6) takes the following form on introducing the location parameter μ є R:
7
The Figures 1 to 4 show graphs for (6) and (7) respectively for various values of the parameters μ, a, b, p and α.
Distribution function
In this subsection, we prove that the distribution function corresponding to (6) is given by
8
Proof. For x > 0, we have
9
Substuting we get
we have
10
By symmetry, we easily write the result for x < 0.
Moments
In this subsection, we obtain the n-th moments about the origin. By definition,
when n is an even integer
11
Then, by expanding the denominator by binomial theorem, we have
12
when n is an even integer.
The variance of X ~ RSG (a, b, p, α) is given by
13
In Azzalini density [3]
s(x) = 2 v(x)V[w(x)], x∈R 14
With w(x)=kx; k 2 R, take v(x) as the density function of X ~ RSG(a; b; p; ) and V (x) as the distribution function of X ~ RS(a; b; p). Then, the density function of generalized skew logistic model X ~ RSGA (a; b; p;α; k) is given by
15
Introducing the location parameter μ∈R, the density function of X ∼ RSGA (a, b, p, α, k) is given by
16
For certain values of the parameters, s(x) and s1(x) are plotted in Figure 5 for k =±0.7 and in Figures 6 and 7 for a=0 and b=0 respectively.
In the present section, five data sets are analyzed by using the distributions defined in earlier sections as well as the mixture of two normals for bimodal data. The estimation of parameters is done by utilizing the method of maximum likelihood estimation. Akaike Cri- terion Information [4], Bayesian Information Criterion, Mean Square Error, Absolute Mean Deviation and Maximum Absolute Deviation are calculated to judge the fit of RSG, RSGA and mixture of two normals. The goodness of fit test of Kolmogorov-Smirnov is used with significance level of 5%. Some packages of sotfware R are used. The GenSA package [5] is used to obtain initial values to interactive algorithm. For interactive algorithm, we use the bbmle::mle2 package [6], in most cases, using BFGS method and optimizer constrOptim to guarantee that the estimated parameters are consistent within their respective parametric space. For more details to adaptive barrier algorithm, see stats::constrOptim into soft- ware R. We obtain the estimates of the parameters, approximate the standard errors of the estimates based on quadratic approximation to the curvature at the maximum likelihood estimate, and a test (z test) of the parameter difference from zero based on this standard error and on an assumption that the sampling distribution of the estimated parameters is normal.
The AIC and BIC for the classification of the model-fit on data sets in various applications will be used. These are defined below
17
where ņpar is the number of parameters to be estimated and l(.;.) is the logarithm of the estimated likelihood function.
18
where η is the number of observations. Mean Square Error (MSE), Mean Absolute Deviation (MAD) and Maximum Absolute Deviation (MD) are defined below:
where is the empirical cumulative distribution and is the fitted cumulative distribution of the data. Of course, the smallest value obtained will indicate that there is a good fit.
Human body fat index
The data consist of 252 observations on 17 variables about human body fat. For details, see Jonhson [7], Penrose et al. [8], and Ambler et al. [9]. Figure 8 demonstrates that the data is unimodal which is also confirmed by test [10,11] with statistics D=0.014114 and p-value near 1. The estimates of the parameters using RSG and RSGA models are given in Table 1.
RSG Parameter | Estimate | Error | z-value | P (z) |
µ | 19.26 | 2.1087 × 10−5 | 9.1336 × 105 | <0.0001 |
a | 0.15401 | 1.127 × 10−2 | 13.662 | <0.0001 |
b | 10−4 | 3.3937 × 10−5 | 2.9467 | <0.004 |
p | 2.1986 | 1.0742 × 10−4 | 2.0468 × 104 | <0.0001 |
α | 1.2338 | 8.1766 × 10−4 | 1.5089 × 103 | <0.0001 |
log L | −890.9885 | |||
RSGA Parameter | Estimate | Error | z-value | P (z) |
µ | 7.8768 | 1.0392 × 10−2 | 757.9289 | <0.0001 |
a | 0.18403 | 2.8006 × 10−2 | 6.5712 | <0.0001 |
b | 10−4 | 3.0035 × 10−5 | 3.3294 | <0.0001 |
p | 2.2996 | 1.9703 × 10−3 | 1167.086 | <0.0001 |
α | 0.35062 | 7.2149 × 10−2 | 4.8597 | <0.0001 |
k | 1.7177 | 2.8455 × 10−2 | 60.3678 | <0.0001 |
log L−889.786 |
Table 1: Estimates associated with RSG and RSGA models.
Table 2 shows the comparison of the models used. Figure 8 presents the histogram with adjusted models. The empirical and theoretical distributions are shows in Figure 9.
Model | K-S | p-value | MSE (10−4) | MAD | MD | AIC | BIC |
---|---|---|---|---|---|---|---|
RSG | 0.047619 | 0.9375 | 1.315639 | 0.009163 | 0.033421 | 1791.977 | 1809.624 |
RSGA | 0.06746 | 0.615 | 1.189378 | 0.008951 | 0.030355 | 1791.572 | 1812.749 |
Table 2: The comparison of adjusted models used.
For AIC, it may be observed that the RSGA fit is better than RSG fit for this data set. The Bayesian criterion indicates a better fit for RSG distribution.
Precipitation
The data consist of 121 observations about annual precipitation (rain) between 1978 and 1998 at the center of the city of Los Angeles. These data were obtained from the site [12]. Figure 10 demonstrates that the data is unimodal which is also confirmed by Hartigan’s test with statistics D=0.027273 and p-value equal to 0.7971. The estimates of the parameters, using RSGA distribution, are given in Table 3.
Parameter | Estimate | Error | z-value | P(z) |
---|---|---|---|---|
µ | 4.0393 | 4.4968× 10−2 | 89.825 | <0.0001 |
a | 49.999 | 2:6007 × 10-4 | 1.9225 × 105 | <0.0001 |
b | 34.113 | 3.9072 × 10−4 | 8.7308 × 104 | <0.0001 |
p | 0.7582 | 0.1095 | 6.9239 | <0.0001 |
α | 2.9333 × 10−4 | 8.0064 × 10−5 | 3.6638 | <0.0003 |
3.838 | 2.6556 × 10-4 | 1.4452.10-4 | <0.0001 | |
Log L-393.2849 |
Table 3: Estimates associated with RSGA model.
Applying the non-parametric Kolmogorov-Smirnov test, the K-S value obtained is 0.07438 with p-value 0.8914, thus not reject the hypothesis that the data satisfies RSGA distribu- tion. In 2014, Eirado et al. [13] proposed an asymmetric model and applied to this data set. The MSE obtained is equal to 0.001058396, the mean absolute deviation (MAD) is 0.02785116 and the maximum absolute deviation (MD) is 0.06496284. Also, we obtained MSE equal to 0.0002414233, MAD equal to 0.01185483 and MD equal to 0.04669135.
AIC and BIC of the fits of the two models are given in Table 3,4. The empirical and theoretical distributions are shows in Figure 11. Clearly, the RSGA distribution gave better fit to the precipitation data.
Model | log-likelihood | AIC | BIC |
---|---|---|---|
RSGA | −393.2849 | 798.5697 | 815.3444 |
Eirado-Rathie | −551.6425 | 1113.285 | 1127.264 |
Table 4: The comparison of the models.
HIV Data
The HIV data with 2843 observations is available in fitdistrplus: Aids2 package of software R, giving the age when a patient is diagnosed with AIDS in Australia in 1991. Table 5 presents the estimates of the parameters of RSG and RSGA models.
RSG | Estimate | Error | z-value | P (z) |
µ | 36.931 | 0.18698 | 197.51 | <0.0001 |
a | 0.16731 | 0.017989 | 9.3006 | <0.0001 |
p | 8.9282 | 5.2278 × 10−17 | 1.7078 × 1017 | <0.0001 |
α | 1.1148 | 0.017463 | 6.3838 | <0.0001 |
log L | −10552.23 | |||
RSGA | Estimate | Error | z-value | P (z) |
µ | 27.477 | 0.031826 | 86.336 | <0.0001 |
a | 0.05717 | 0.001222 | 46.779 | <0.0001 |
p | 9.7371 | 1.0564 × 10−15 | 9.2174 × 1015 | <0.0001 |
α | 3.5391 | 0.20708 | 17.091 | <0.0001 |
k | 4.5317 | 0.17192 | 26.359 | <0.0001 |
log L−10508.95 |
Table 5: Estimates associated with RSGA and RSG models.
NORSKEW | Estimate | Error | z-value | P (z) |
---|---|---|---|---|
µ | 37.5304 | 0.187355 | 200.317 | <0.0001 |
σ | 10.01696 | 0.13529 | 74.041 | <0.0001 |
ξ | 1.273675 | 0.031561 | 40.355 | <0.0001 |
log L −10549.26 | ||||
µ | 37.40907 | 0.1887 | 198.245<0.0001 | |
σ | 10.06149 | 0.13343 | 75.406<0.0001 | |
log L−10597.72 |
Table 6: Estimates associated with asymmetric normal and normal distributions.
Histogram and RSGA distributions to HIV data are shown in Figure 12 while Empirical and RSGA distributions in Figure 13. In Table 7, the Kolmogorov-Smirnov test rejects almost all adjusted distributions except RSGA distribution.
Model | K-S | p-value | MSE(10−4) | MAD | MD | AIC | BIC |
---|---|---|---|---|---|---|---|
RSG | 0.289524 | 0.0014 | 4.376593 | 0.017052 | 0.04955 | 21112.47 | 21136.28 |
RSGA | 0.063492 | 0.69 | 1.450326 | 0.009691 | 0.032249 | 21027.9 | 21057.66 |
NORSKEW | 0.041857 | 0.01373 | 3.033533 | 0.014451 | 0.040824 | 21104.53 | 21122.39 |
NORMAL | 0.059796 | 7.696 × 10−5 | 8.539093 | 0.025396 | 0.058367 | 21199.44 | 21211.35 |
Table 7: Comparison of the models used. Comparison of the models used.
pH Concentration data
The pH concentration data [14] with 252 observations show bimodality which is also demonstrated by Hartigan’s test with statistics of the test equal to 0.046498 and p-value of 0.00045. The estimates of the parameters are given in Table 8.
RSGA | Estimate | Error | z-value | P (z) |
---|---|---|---|---|
µ | 3.094726 | 0.071289 | 43.4109 | <0.0001 |
a | 8.242063 | 2.241954 | 3.6763 | <0.0003 |
b | 0.003 | 0.001066 | 2.8153 | 0.004874 |
p | 6.244648 | 0.344886 | 18.1064 | <0.0001 |
α | 0.045077 | 0.011673 | 3.8616 | <0.0002 |
k | 0.86603 | 0.335523 | 2.5811 | 0.009848 |
log L −364.2 | ||||
µ | 4.918676 | 0.042907 | 114.6364 | <0.0001 |
a | 6.027683 | 0.616692 | 9.7742 | <0.0001 |
b | 2.906972 | 1.071397 | 2.7133 | <0.007 |
p | 2.711035 | 0.459798 | 5.8961 | <0.0001 |
α | 0.068114 | 0.006893 | 9.8812 | <0.0001 |
log L −363.7172 |
Table 8: Estimates associated with RSGA and RSG models.
Silva et al. [15] proposed two new asymmetric models by Azzalini’s method h1(x) and h2(x) where the pH concentration data was fitted by these two models. Table 10 shows the performance of the fitted distributions.
Using package of Benaglia et al. [16], the estimates of mixture of normals are given in Table 9 with parametric bootstrap performed for standard error approximation.
Parameters | Component 1 | Component 2 | Error of Component 1 | Error of Component 2 |
---|---|---|---|---|
λ | 0.50439 | 0.49561 | 0.041677 | 0.0416768 |
µ | 3.892103 | 5.961384 | 0.076694 | 0.07539492 |
σ | 0.575443 | 0.568638 | 0.056243 | 0.05409495 |
log L | −366.8661 |
Table 9: Estimates of mixture of two normal.
Histogram of pH values along with the distributions adjusted are shown in Figures 14 and 15
Table 10 gives the accuracy values of AIC, BIC, MSE etc, for various models. The RSG model adjusted well the bimodal data.
Model | K-S | p-value | MSE (10−4) | MAD | MD | AIC | BIC |
---|---|---|---|---|---|---|---|
RSG | 0.06746 | 0.61 | 1.814886 | 0.01067501 | 0.039083 | 737.4343 | 755.0871 |
RSGA | 0.075397 | 0.4709 | 2.546568 | 0.01283771 | 0.038684 | 740.4067 | 761.5833 |
NORMIX | 0.083333 | 0.3457 | 7.407901 | 0.02202145 | 0.064505 | 743.7322 | 761.3793 |
h1(x) | – | 0.8316 | 3 | 0.0152 | 0.0373 | 744.6913 | 776.4561 |
h2(x) | – | 0.09438 | 96 | 0.0912 | 0.1454 | 857.387 | 889.1519 |
Table 10: Comparison of the models used.
Relative Humidity (RH)
The RH observations data are taken from Nychka et al. [17]. The estimates of the parameters for RH data using the RSGA model are given in Table 11.
Parameter | Estimate | Error | z-value | P (z) |
---|---|---|---|---|
µ | 59.72236 | 0.008989 | 6643.879 | <0.0001 |
a | 0.034228 | 0.016025 | 2.1359 | <0.04 |
b | 0.002588 | 0.001281 | 2.0199 | <0.05 |
p | 1.227392 | 0.151744 | 8.0886 | <0.0001 |
α | 0.266291 | 0.115667 | 2.3022 | <0.03 |
k | −0.4621166 | 0.095596 | −4.8341 | <0.0001 |
Table 11: Estimation of the parameters of the RSGA model.
The estimation for a mixture of two normal s are given in Table 12. The values of AIC, BIC etc. measuring the quality of fit are given in Table 13.
NORMIX | Component 1 | Component 2 | Error Component 1 | Error Component 2 |
---|---|---|---|---|
λ | 0.6975 | 0.3025 | 0.025634 | 0.02563423 |
µ | 36.8122 | 77.08626 | 0.865337 | 1.139471 |
σ | 11.835 | 9.28641 | 0.648855 | 0.8474422 |
log L | −1958.626 |
Table 12: Estimation of the parameters of the mixture of two normal.
Model | K-S | p-value | EQM (10−4) | MAD | MD | AIC | BIC |
---|---|---|---|---|---|---|---|
RSGA | 0.080178 | 0.1115 | 8.497387 | 0.0217099 | 0.076005 | 3926.544 | 3951.182 |
NORMIX | 0.073497 | 0.1768 | 9.316102 | 0.02176488 | 0.066207 | 3927.252 | 3947.787 |
Table 13: Comparison of the models used,
In Figure 16, the histogram and the fit using Empirical, RSGA and the mixture of two normals distributions are shown. In Figure 17, the empirical and theoretical distributions are shown.
The Rathie-Swamee generalized distribution (RSG) and its skew form (RSGA) proved useful to five data sets analyzed, thus demonstrating their applicabilities over the mixture of two normals, in case of bimodal sets (pH concentration and relative humidity).
P. N. Rathie thanks the Coordination for the Improvement of Higher Level Personnel (CAPES) for supporting his Senior National Visiting Professorship.