Classifying Exoplanets with Gaussian Mixture Model

Recently, Odrzywolek and Rafelski (arXiv:1612.03556) have found three distinct categories of exoplanets, when they are classified based on density. We first carry out a similar classification of exoplanets according to their density using the Gaussian Mixture Model, followed by information theoretic criterion (AIC and BIC) to determine the optimum number of components. Such a one-dimensional classification favors two components using AIC and three using BIC, but the statistical significance from both the tests is not significant enough to decisively pick the best model between two and three components. We then extend this GMM-based classification to two dimensions by using both the density and the Earth similarity index (arXiv:1702.03678), which is a measure of how similar each planet is compared to the Earth. For this two-dimensional classification, both AIC and BIC provide decisive evidence in favor of three components.


INTRODUCTION
Over the past two decades, there has been a revolution in the field of exoplanet astronomy following the confirmation of more than 3000 planets orbiting stars other than the Sun (see Ref. [3] for a recent review and Ref. [4] for a summary of exoplanet detection techniques). Lot of work has been done to characterize the properties of the exoplanets discovered using the myriad techniques [5]. Recently, Odrzywolek and Rafelski [1] (hereafter OR16) have carried out the classification of exoplanets according to their density, following a suggestion long time back [6]. OR16 fitted the exoplanet density data to lognormal distributions to determine the optimum number of components. They found three lognormal components with peak densities at 0.71 gm/cm 3 , 6.9 gm/cm 3 , and 29.1 gm/cm 3 [1]. These three components correspond to ice/gas giants, iron/rock super-Earths, and brown dwarfs respectively. The optimum number of components was determined by maximizing the log-likelihood and then checking the goodness of fit for different number of components by calculating the p-value from three distinct non-parametric tests.
We would like to do a variant of the above analysis by carrying out a similar classification according to density using Gaussian mixture models, followed by information theoretic criterion to determine the optimum number of classes. We have previously used this procedure, to perform a unified classification of all GRB datasets using three different model comparison techniques [7]. We then extend the analysis of OR16 by considering two-dimensional classification using both the density and Earth similarity index. This paper is organized as follows. In Section 2, we describe the dataset and the physical quantities used for the classification. The mathematical basis for the classification is discussed in Section 3. Our results are shown in Section 4 and we conclude in Section 5.

Exoplanet Catalog
In the one-dimensional classification, we shall study the trends obtained from the confirmed detections, by classifying the exoplanet database according to their densities, for which we need the mass and radius of the planets. We obtain the mass and radius information from the catalogs uploaded on the NASA Exoplanet archive 1 and the Extrasolar planet encyclopedia 2 as of February 18, 2017. From these datasets, we consider only those planets with measured values of mass and density, and which exist in both the datasets with the same observed values to avoid any irregularities and to maintain consistency in the dataset. The NASA Exoplanet archive is a NASA funded public data service, which is hosted by the Infrared Processing and Analysis Center. This catalog lists only those objects, for which their detection and planetary status is sacrosanct. As of Feb 18, 2017, it contained a total of 3440 planets out of which 531 have measured mass and radius values detected. Most of the planets listed in this catalog have been detected using transit photometry. The Extrasolar planet encyclopedia is maintained by the Meudon Observatory in Paris and as of Feb 18, 2017 contained total of 3567 planets (most of which were also detected using transit photometry), of which 615 have measured values for all the parameters. The data provided by the two catalogs is similar except for some differences in their selection criteria. The Extrasolar planet encyclopedia allows planets weighing from 60 Jupiter Mass onwards, whereas the NASA Exoplanet archive uses 30 Jupiter mass as the lower limit, which is also the reason for the smaller number of confirmed exoplanets in the latter. However, one caveat is that the catalog is continuously updated and sometimes false detections are removed as the data gets subjected to more scrutiny. Therefore, in order to obtain a gold sample, we have selected 450 observations, which are common to both the datasets for our study in this 1 https://exoplanetarchive.ipac.caltech.edu/ 2 http://exoplanet.eu arXiv:1708.00605v2 [astro-ph.IM] 1 Jun 2018 paper. Both the datasets used for this analysis as well as the code which looks for common planets between the two catalogs have been uploaded on github and can be found at https://github.com/IITH/Exoplanet-Classification In addition to the one-dimensional classification using only density, we also carry out a two-dimensional classification, wherein we use both the density and the Earth Similarity Index (or ESI) [2] for the classification. For this, we need some additional parameters for the calculation of ESI. The additional parameters that we need apart from the radius and density are the surface temperature and time period of revolution, as other parameters can be derived from the mass and radius. The escape velocity and surface gravity are calculated by positing that the shape of the planet is a perfect sphere, wherein the total mass is distributed uniformly throughout the volume. We only consider planets for which we have the observed values for all four of these parameters.

Calculations for the data:
Assuming the planet is a perfect sphere with a uniform mass distribution, the expression for density is: The escape velocity is given by: The surface gravity is obtained from: where G is the Gravitational Constant, M is the mass of the planet and R is the radius.
ESI is a figure of merit used to ascertain how habitable is the planet for life to develop compared to the Earth. More details on the theory behind ESI can be found in the work by Kashyap [2], which in turn follows the prescription from Schulze-Makuch et al. [8] (See also [9] for alternate indices proposed similar in spirit to ESI). The ESI is based on six different parameters, viz. density, radius, temperature, surface gravity, escape velocity, and the time period of revolution around their Sun. All these parameters are normalized to Earth units, as it is convenient for the index calculation. The ESI is calculated based on the Bray-Curtis Similarity index [10] and is given by: where x is the parameter for which the index has to be calculated, x 0 is the reference values which in our case is one, as we have expressed all parameters in Earth units and w is the weight exponent. The total ESI is given by: The values of ESI range from 0 (completely different from Earth) to 1 (resembling a clone of Earth).

ANALYSIS METHODS:
We outline the method used for both the one-dimensional classification using density and the two-dimensional classification using density and ESI. For finding the best-fit parameters, we use the Gaussian-mixture Model (GMM) [11], which is part of the Scikit-learn package, used for a variety of machine learning applications in python. The GMM fits the data to a mixture of multiple (k) lognormal Gaussian distributions, which are characterized by their mean, covariance and their respective weights in the fit data. The GMM method uses the Expectation Maximization (EM) algorithm [12] to maximize the likelihood function over the given parameter space. The GMM method can also be generalized to include error bars and this generalized GMM algorithm is referred to in the astrophysics literature as Extreme Deconvolution [13]. However, since we are using a planet catalog measured in two separate datasets having negligible error bars, we stick to the ordinary GMM method. Given the probability distribution function f (x, θ), where x are the observed datapoints, θ are the parameters used to define the function, N being the total number of exoplanets in our study, and w k denotes the weights associated with each of the k log normal distributions, the likelihood can be defined as: and the probability distribution function for a univariate Gaussian as: A generalized bivariate Gaussian distribution can be defined as: where ρ = V12 σ1σ2 is the correlation, V is the covariance of the two variables and µ x is the mean log(density) and µ y is the mean ESI. An additional condition being used in the EM algorithm is the normalization condition: In this study, we use the GMM method for the k = 2 and k = 3 lognormal fits to the data followed by information theory based model comparison methods to assess the best fit amongst these two models.

Model Comparison
Once we have obtained the best-fit parameters for each model, we need to select the optimum model from all the possibilities being considered. Naively, the simplest way to do model comparison would be by carrying out likelihood comparison between the competing models and choosing the model with the highest likelihood as the best model. However, the maximization of likelihood could lead to an overfitting of the model to the data with additional degrees of freedom and hence we need a more robust and accurate criterion, which will penalize the use of extra free parameters. This can be done by using the Information criterion tests, such as Akaike Information Criteria (AIC) and the Bayesian Information Criteria (BIC), which are commonly used in Astrophysics literature [14,15,16,17,18] (and references therein). These information criteria-based methods provide a way to penalize the excess free parameters and determine the best model accordingly.

AIC:
Akaike Information Criteria or AIC [19] penalizes lightly the excess free parameters and is defined as: where p is the number of free parameters in the model and L is the likelihood. The AIC defined in Eq. 10 is valid when the ratio N/p is very large i.e. > 40. For a ratio less than this, a first order correction is included and the modified expression is given by: Throughout our data, the ratio is greater than the value prescribed and hence we do not account for this correction in our study. The preferred model is the one with a lower value of AIC and the efficacy of this hypothesis is determined using the quantity: where ∆AIC i value corresponds to the preference of the model i over the model with the lower AIC value and hence is the null hypothesis. The confidence in the model can be determined by the magnitude of the ∆AIC value. Although one cannot formally calculate p-values from ∆AIC, one usually uses qualitative strength of evidence rules to judge the efficacy of a given model [14,18,20]. As pointed out by Liddle [18], the value for the best model will be, ∆AIC i = 0. Now, if 0 < ∆AIC i < 2, then we can say that we have a weak or no statistical evidence to reject the i th model over the null hypothesis. 2 < ∆AIC i < 6 implies that the model has only weak support and has evidence against this model. For models with ∆AIC i > 6 there exists strong evidence against the model and ∆AIC i > 10 implies a very strong or decisive evidence against the i th model. These rules can be applied directly for the BIC criterion (next subsection) as well.

BIC:
Bayesian Information Criterion or BIC was used by Schwarz [21] and is used to penalize the free parameters much more harshly than the AIC criterion and is defined as: Again, the preferred model is the one with the lower values of BIC and is taken as the null hypothesis for further determining the significance of different models.
Similar to the significance test for the AIC criterion, the ∆BIC i value acts as the significance measure for the BIC test and follows the same values as for AIC. The only difference being that according to BIC criterion, the penalty for a model with extra number of free parameters is harsher compared to AIC.

RESULTS:
4.1. 1D classification We first describe our results for the one-dimensional classification using only the density. We apply the techniques and methods described in the previous sections to the exoplanet catalog, generated by filtering the data from the NASA Exoplanet archive and the Extrasolar Planet encyclopedia as mentioned earlier. For the density function, we find the best-fit model parameters for k lognormal distributions according to the density from Eq. 6. Each distribution is characterized by its mean, standard deviation and the weight of the distribution indicating the number of planets that have been classified under that particular distribution. We apply the GMM routine to the density functions after varying the number of Gaussians from 1 to 14.   The GMM based fit for the density of the exoplanets using the best-fit parameters from Eq. 6 for k = 3. Details of the fit can be found in Table 1.

TABLE 1
Model Comparison parameters for exoplanets, when classified according to density. The first column denotes the number of components used for the fitting. The second column is the best-fit value for the logarithm of the density (expressed in gm/cm 3 ), the third column is its standard deviation, and the fourth column indicates the number of planets classified in each category. The next two columns indicate the AIC and BIC value for both the hypothesis. The last two columns indicate the absolute value of ∆ (AIC/BIC) between the model with minimum value and the next highest value. As we can see, AIC prefers 3 components, whereas BIC prefers 2. However, ∆AIC and ∆BIC are both < 10 and hence their significance (compared to the value with higher AIC/BIC) is marginal. The scatter plot in Fig. 1 shows all the selected planets for the study as a function of their mass and radius. The distribution looks clustered in certain areas with lots of outliers. The density distribution of 450 exoplanets with their histograms can be found in Fig. 2 and Fig. 3 for the 2-Gaussian and 3-Gaussian fits respectively and we can see intuitively that no difference can be discerned by eye from the two figures. Both the models fit well the distribution of the density function, and hence we have to rely on quantitative model comparison tests that have been carried out on the data, viz. the AIC and the BIC test. As seen in Fig. 4, the BIC test indicates that the 2-component model is the optimum model as it has the minimum BIC value followed by the 3-component model, which has a larger value than the two component model. This trend is different from the AIC test, as the AIC has a minimum for three Gaussians, indicating that this is the best model, followed by the two-component model. These results if compared to the previous attempts at one-dimensional classification done by OR16 are very similar in both, the two Gaussian and the three Gaussian models proposed in this paper. From the 2-component model, the mean density values are at 0.88 gm/cm 3 and 9.69 gm/cm 3 , with each class containing 322 and 128 exoplanets respectively. The inferred mean values of the density for the 3-component model are at 0.71 gm/cm 3 , 2.03 gm/cm 3 and 88.1 gm/cm 3 with 225, 175, and 50 in each of the classes respectively. In the previous study by OR16, the mean density values are at 0.7 gm/cm 3 and 6.3 gm/cm 3 with 320 and 106 respectively in each class for 2 components and at 0.71 gm/cm 3 , 6.9 gm/cm 3 and 29.1 gm/cm 3 with 340, 80, and 7 exoplanets respectively in each class for the 3 components.
From Tab. 1, we see that for the AIC test, the best model preferred is the 3-Gaussian model but there is sufficient confidence shown in both, the 2-Gaussian model and surprisingly the 4-Gaussian model (see Fig. 4) while rejecting all other models by a huge margin. As described in Sect Therefore, the results from the two information criterion tests do not agree. However, ∆AIC and ∆BIC are both less than 10 between the two and three component model, so no one model among these is decisively favored between the two.

2D Classification:
We now proceed to a 2-dimensional GMM based classification using both the logarithm of the density and ESI. We use the combined data from ESI and density using the datasets specified earlier in the manuscript and perform a two-dimensional GMM analysis. We consider only the planets that have measured values for all the quantities required for the calculation of ESI. A total of 450 exoplanets were analyzed for a range of lognormal components. As we can see from Fig. 7, we get a result that is similar to the one we saw in the above case of 1-D classification, where the 3-component component was preferred only with AIC, albeit with marginal significance, using only the density as a parameter. From this two-dimensional analysis using the total ESI along with the density, we have both AIC and BIC preferring the 3-component distribution over all the other ones and by a substantial margin. The best-fit values of the parameters along with their covariance, as well as the ∆AIC and ∆BIC values for the two and three component distributions can be found in Tab. 2 The AIC and BIC tests both point to definitive evidence for one model (three components) and give concordant results. From the statistical confidence measures ∆AIC and ∆BIC, we can assert our confidence in the hypothesis of the three-Gaussian model over all other model fits. We find that for the next preferred models in the analysis, the Figure 6. The scatter plot of the distribution using the two components, log (density) and total ESI. The three ellipses represent the 1σ confidence level region for the 3-component model, which are centered at the means of the distribution acquired from best-fit of Eq. 6 and Eq. 8. ∆AIC = 14 and ∆BIC = 22, which is significant enough to reject the respective models in favor of our null hypothesis with strong confidence.

CONCLUSIONS
In this manuscript, we have undertaken a classification of the exoplanet catalog using clustering based on the logarithm of the planet density (similar to a recent analysis in OR16 [1]), followed by a 2-dimensional analysis using both the log of density and Earth Similarity Index (ESI) [2] for each of the exoplanets. We use Gaussian Mixture Model to classify the data for both the one-dimensional and two-dimensional classifications based on log(density) and {log(density), ESI} respectively. For both of these classifications, we determine the best-fit parameters for each model using the EM algorithm. We then use information theoretic criterion, such as AIC and BIC to determine the optimum number of free parameters. Our results are as follows: 1. For the one-dimensional approach, our analysis does not provide a conclusive evidence between a two-component