Characterizing the Sample Selection for Supernova Cosmology

Type Ia supernovae (SNe Ia) are used as distance indicators to infer the cosmological parameters that specify the expansion history of the universe. Parameter inference depends on the criteria by which the analysis SN sample is selected. Only for the simplest selection criteria and population models can the likelihood be calculated analytically, otherwise it needs to be determined numerically, a process that inherently has error. Numerical errors in the likelihood lead to errors in parameter inference. This article presents toy examples where the distance modulus is inferred given a set of SNe at a single redshift. Parameter estimators and their uncertainties are calculated using Monte Carlo techniques. The relationship between the number of Monte Carlo realizations and numerical errors is presented. The procedure can be applied to more realistic models and used to determine the computational and data management requirements of the transient analysis pipeline.


Introduction
In the early days of supernova cosmology, hard redshift cuts were applied to spectroscopically classified Type Ia supernovae (SNe Ia) in order to mitigate against Malmquist bias and biases due to non-Ia contamination [8,7]. Malmquist bias arises because SN Ia peak magnitudes have an intrinsic dispersion; if in a magnitude-limited survey intrinsically faint undetected SNe are unaccounted for, inferred distances can be underestimated. Most other transients discovered in imaging surveys are intrinsically fainter than SNe Ia, and including these objects in an analysis as if they were SNe Ia would lead to overestimated inferred distances. In the past, strict sample-selection cuts made these sources of bias negligible, though at the expense of reducing the sample size.
A motivation for relaxing the sample-selection criteria is to extend the redshift range of the SN Ia Hubble diagram and to take advantage of the large num-ber of SN discoveries from modern supernova surveys. SNe at higher redshift have fainter observed magnitude, making them more susceptible to Malmquist bias and more difficult to classify spectroscopically; given finite follow-up resources we will have to rely on photometric classification [1,5] in order to consider all discoveries. Analysis models then need to accommodate not only SNe Ia but also a contaminant population [6]. Under these circumstances, the effect of sample selection cannot be ignored [9,4,3].
The criteria for sample selection are becoming more sophisticated. Certain criteria, such as those based on signal-to-noise, color, and data quality, are simply expressed as a function of the observed data. Machine learning classifiers are being developed to select likely SNe Ia [10], but their inner workings can be opaque and not easily modeled as a simple function of data. Determining whether a supernova enters the sample requires nothing short of running its data through the black box classifier. Without an analytic expression for the sample selection, the volume of classifications required to characterize the classifier to achieve a targeted precision in parameter inference could be computationally non-trivial.
An added complication in modern surveys is that sample selection may not entirely defined by the cosmology analysis team. For example, the transients discovered by the Vera C. Rubin Legacy Survey of Space and Time (LSST) are expected to undergo a series of pipelines, each the separate responsibility of either the Project, Broker teams, or Science Collaborations. The LSST Project plans to identify transients in cadenced imaging surveys, triggering alerts that will be distributed to a small number of third-party brokers who will distribute filtered subsets to the community in near-real-time. In addition, all alerts and objects will be available within 24 hours of shutter close through the Project's Prompt Data Products databases. 1 Science teams that use these products to define their samples may find that they need to understand the sample-selection efficiencies of antecedent processes that are the responsibility of other parties.
This article presents a procedure to determine errors in parameter inference due to numerical errors that naturally arise when taking into account sample selection. Up until now, the sample selection in SN Ia cosmology analysis could be effectively characterized by the properties of the measurement from a single observation or visit (e.g., signal-to-noise, percent flux increase, magnitude) or be otherwise due to processes independent of the source properties (e.g., mis-subtractions, lack of spectroscopy follow-up time). Now that new surveys discover large numbers of non-SN Ia transients, the sophisticated classifiers now included as part of analysis pipelines complicate sample-selection characterization. The motivation of this work is to determine how often classifiers have to be queried in order to achieve precise distance inference. Though we are motivated by LSST, the procedure is of general applicability. In §2 we present a generic model, its likelihood, and the best-fit parameters and uncertainties calculated from the likelihood, for a survey with sample selection. Monte Carlo integration is introduced to calculate the numerical integrals involved. For toy examples, in §3 we propagate numerical-integration errors into those of the best-fit parameters and parameter uncertainties, as a function of the number of Monte Carlo samples. A discussion on those results and their implications on computing resources are given in §4. Conclusions are given in §5.

2.
Distances from a contaminated sample of standard candles

The model
This article considers a simple model that captures the main complications of future SN cosmology analyses. We consider standard candles each with a measured distance estimate, without reference to the processes of getting that estimate from observables. All the objects have the same distance, which is the limit of considering a subset of SNe that lie within a narrow redshift bin. There are two populations of objects, one brighter candle with tight intrinsic magnitude dispersion and a second fainter "candle" with a broad magnitude dispersion. The observed sample of objects is magnitude limited, replicating a detection threshold in their discovery. The discovered sample undergoes further classification filtering, using supplemental data and potentially the distance measurement, in an effort to use only objects from the bright population. The classifier may not be perfect, filtering out some of the desired population while leaking in some of the undesired fainter population. The model has two toplevel parameters: the distance of the objects and the underlying ratio in the numbers of the two populations.
Here we summarize the assumptions, parameters, parameter dependencies, and notation for the model.
• All objects are at the same distance denoted by their distance modulus µ.
• There are two populations labeled as T = 0, 1. The T = 0 population is of primary interest, being bright and precise distance indicators (e.g., SNe Ia) while the T = 1 population represents the background (e.g., corecollapse supernovae). The underlying fraction of type T = 0 is given by • All objects have a magnitude observable m. The magnitude datum from a candle of population T is drawn from a Normal distribution where µ is the distance modulus, M T the candle's absolute magnitude, and σ T the standard deviation, which is due to a combination of intrinsic dispersion and measurement noise. Objects may also have additional data x that are not sensitive to µ but are used in object detection and classification, e.g., colors, light-curve shape, spectra, host-galaxy properties.
The above model for perfect standard candles does not apply to SNe Ia, which are considered to be a family of similar objects where the absolute magnitude M T of each individual SN is a parameter that can be inferred from data. The intrinsic magnitude dispersion of SNe is not necessarily Gaussian. In addition, foreground effects e.g., dust extinction and, gravitational lensing, also affect observed magnitudes.
• The data from different objects are independent.
• Candles entering the sample must exceed a brightness threshold m lim . Each candle has a parameter S = 1, 0 indicating whether it does or does not meet the brightness criterion. P (S|m) = δ SS(m) where S(m) = H(m lim − m) and H is the Heaviside function. In terms of LSST, this is akin to saying that the Project will efficiently generate magnitude-limited transient alerts.
• Candles entering the sample are typically classified as having a high probability of being T = 0. A classifier ingests data and outputs results used to decide whether to include an object in the final sample. The classification decision is given by τ (m, x) = 1, 0 for an object included/excluded from the sample. P (τ |m, x) = δ τ τ (m,x) . For LSST the classification process may be composed of several distinct components including live Public Broker assessments, real-time observing decisions, and a posteriori classifications based on the full light curve.
The performance of the classifier is described by its efficiency 0 that a candle of T = 0 has τ = 1, and its false-positive probability 1 that a candle of T = 1 has τ = 1. The probability of object of type T being classified as τ is P (τ |T ) = ( 0 δ 0T In place of a real classifier, in the examples presented in this article the classifier efficiency and false-positive probability are taken to be random, hence completely uncorrelated with m. In this case the classifier itself does not induce a bias in distance modulus. Most classifiers, however, are expected to use magnitude information and the presented methodology corrects for any potential bias.

Parameter estimators and uncertainties
In this work, the likelihood and its calculation are split into two more basic building blocks. The first is the expected fraction of all objects that pass both detection and classification criteria to make it into the analysis samplē As will be shown shortly, it is the calculation of this term and its partial derivatives with respect to the parameters that can require a large number of evaluations of the sample selection function to consider all objects that could have entered the sample. The second is the probability of an object with magnitude m making it into the analysis sample.
This term requires calculating the sample selection probability of only those objects that are in the sample. The likelihood is the probability given by the model for the sample-selected data, which is predicated on the detection and classification selection criteria being satisfied. Recalling the independence of the data across objects where the index i runs over the N objects in the final sample. All objects contribute a common term for the probability of an arbitrary object making into the sampleS and an individual term of its own probability of discovery R.
The Hessian of the likelihood surface is Its value at the extremum provides an estimate of the parameter uncertainty.
In this article, the statistical uncertainty in µ is given by σ µ ≈ H −1 µµ . The estimators and their uncertainties are subject to error when there are uncertainties in the logarithmic derivatives ofS. The estimator depends on the average value of ∂ ln R/∂θ, but is otherwise independent of N . The Hessian similarly scales with N . The errors in the parameter estimators and the fractional errors in the parameter uncertainties presented in this article are thus independent of the sample size.
Given the model parameters and their probabilities, the functionsS, R, and their partials can be expressed including the latent variables T and m as and The second derivatives with respect to p 0 are zero. Although not denoted explicitly, the supplement data x used only for detection and classification enter S and R implicitly through P (τ |m, x).

Monte Carlo integration and its errors
S and its derivatives involve an integral whose integrands depend on the classifier through τ (m, x) and the detection selection S(m, x). In general these functions are challenging to express analytically with precision. Even if the integrands were analytic, the integrals themselves are generally non-analytic. The integrals are solved numerically through Monte Carlo integration, with the integrand evaluated at randomly drawn sets of {m, x} values. The precision of the integration is dependent on the number of times the integrand is sampled. Therein lies the technical motivation of this work, the requirements on the number of evaluations of S(m, x) and τ (m, x) necessary to achieve precise distance inference.
In our model, the detection threshold S(m) is simply described by a step function while the classifier is treated as a black box for which τ (m, x) can't be written analytically. Efficient integration focuses on sampling magnitudes that already satisfy the detection criterion without reference to the classifier, drawn from This probability distribution function (pdf) enters the integrand ofS (and its derivatives) as Monte Carlo integration draws N MC realizations from a fiducial distribution with fixed parameters µ 0 , p 00 (in the calculations that follow, we optimistically set µ 0 and p 00 to their input values), Then the Monte Carlo integration ofS is The partial derivatives ofS are calculated similarly.
The calculated values ofS are used to deduce the estimator and uncertainty of µ. An independent calculation ofS, using different realizations for the Monte Carlo integration, produces different values ofμ and σ µ . The standard deviations of the distributions ofμ and σ µ from many sets of Monte Carlo realizations represent the error of interest in this article.

Pure sample
Before continuing with the general model, we take a short detour to consider the special case of a Pure sample, where there are no false positives ( 1 = 0) and all objects are of type T = 0. The classifier nevertheless may not be efficient, 0 = 1, so that the effect of the classifier must be accounted for. This example corresponds to an analysis of a spectroscopically pure sample. This case is useful to review, as the estimator and uncertainty simplify into expressions that may be more easily recognized by the non-specialist.
There are a few minor changes to the equations shown in the previous section.
The change from the general case is that T is no longer a latent parameter, being replaced by T = 0 and the summation over the two types is removed.
• Efficient Monte Carlo integration draws from p(m|S = 1, T = 0, µ, p 0 ), in terms of which the integrand ofS is • The pure sample gives no information on the relative rates, so that partial derivative of the likelihood with respect to p 0 is zero.
• The partial derivatives of R simplify to resulting in an estimator forμ that solves and uncertainty estimated from evaluated atμ. If the sampler were complete, then ∂S/∂µ = 0 and the estimator in Eq. 19 would simplify to being the weighted mean of the measurements with statistical uncertainty σ 0 / √ N from Eq. 20.
The estimatorμ and its statistical uncertainty σ µ are calculated from Eqs. 19 and 20 respectively for many instantiations of Monte Carlo draws. The standard deviations in these calculated values represent their errors due to MC integration.
To illustrate, we take an example setting the absolute magnitude to M 0 = 0, the magnitude dispersion to σ 0 = 0.1 mag, the classifier efficiency to 0 = 0.95, and no false positives 1 = 0. The input model parameters for the distance modulus and Population-0 fraction are µ = 0 and p 0 = 0.5 respectively.
Bothμ and σ µ depend on the sample data through R. In place of instantiating a simulated data set for these calculations, we adopt a "typical" dataset that satisfies where "cdf" is the cumulative distribution function of a Normal distribution. The errors due to Monte Carlo integration in the calculation ofS and its partials are shown in Figure 1, which plots the 16, 50, and 84%-iles of the calculated valuesS, ∂S/∂µ, and ∂ lnS/∂µ for m lim = −0.1 mag and independent draws of N MC =10,000. Several instantiations of these functions are overplotted, showing that the errors in the function values at different µ are correlated. The curves are linear given the choice of a classifier that is uncorrelated with magnitude; classifier output that are correlated with magnitude would imprint µ-dependent structure in these curves.
The estimatorμ is the value of µ where the term on the left of Eq. 19 is equal to the data-dependent, µ-independent term on the right. Figure 2 plots the left-hand term for a number of instantiations of the MC integration, and the the constant values of the right-hand terms as horizontal lines. The estimators, where the sloped curves intersect the horizontal line, are different for each MC realization, and the estimator errors are given by the standard deviations of these intercepts.
We call attention to several interesting features seen in Figure 2. The intercepts are centered around the input µ = 0, confirming that the ∂ lnS/∂µ term in Eq. 19 properly accounts for the Malmquist bias in the magnitude-limited sample. Within the range of the plot, the sloped curves are nearly parallel, meaning that the Monte Carlo errors are not so sensitive to the data, i.e. a different value of the right-hand term and location of the horizontal line. The   slope of the left-hand term varies with the magnitude limit m lim , from 1 when ∂S/∂µ = 0, decreasing to 0 as the threshold gets more severe. Shallower slopes are responsible for larger errors in the intersection of the horizontal line and hence inμ. We again emphasize that lack of structure in the left-hand term is a feature of the classifier considered. A classifier that has ∂ lnS/∂µ + µ that is not monotonic in µ can have a likelihood with multiple local maxima whose characterization requires more detailed analysis than used here. While the same trend is true for the error in the uncertainty s.d.(σ µ ), s.d.(σ µ )/σ µ does not monotonically decrease with increasing limiting magnitude because σ µ decreases even faster with limiting magnitude.
For a survey with a target precision in µ, the required number of Monte Carlo samples N MC can be read from Figure 3. For example, the uncertainty is < 0.01 mag for N MC = 1, 000 with limiting magnitude as bright as m lim = −0.2 mag. Correct calculation of parameter uncertainties is as important as the estimators themselves. A 1% error fractional uncertainty is obtained with N MC =10,000.

General example
We now return to the general case of a Two-Population model with sample contamination. The true type of each object is uncertain, so the latent parameter T comes into play. The data can longer be isolated into a parameterindependent term as was the case for Pure sample. The relative rate of the two populations p 0 must now be considered. This example corresponds to an anal-ysis that would be undertaken when the sample is selected using an imperfect photometric classifier.
The example considered here is similar to that of the Pure sample. Population 0 has absolute magnitude M 0 = 0 and magnitude dispersion of σ 0 = 0.1 mag. The contaminating population has M 1 = 1 and magnitude dispersion of σ 1 = 0.5 mag. The classifier efficiency remains 0 = 0.95, but now the outlier population can pass the classification criteria with probability 1 = 0.05. The fiducial model parameters are µ = 0 and p 0 = 0.5.
Results are presented for one realization of the data. They yield a best-fit values that are not precisely the fiducial parameter values but are within statistical uncertainties. The sample data are selected from 10,000 objects from both populations that satisfy the magnitude-detection threshold, which are subsequently filtered by the classifier.
While the distance modulus µ is the focus of this article, the populationratio parameter p 0 has to be accounted for. The estimatorp 0 solves Eq. 7, which is calculated for many different instantiations of Monte Carlo integrals with N MC =10,000 and plotted in Figure 4. The average solution is centered near the fiducial p 0 . The dispersion in thep 0 is relatively large; in fact for N MC =1,000, a significant fraction of the calculated partial derivatives of the log-likelihood do not have a root within the allowed range [0, 1]. The error inp 0 is smaller for m lim = −0.1 than it is for 0.1, a result of having a larger fraction of Population 1 objects that pass the magnitude cut (the broad bright tail of Population 1 contributes relatively more than the narrow tail of Population 0), which leak into the classified sample. The function that solves theμ estimator looks similar to that of the Pure example and so is not shown.
The estimatorsμ andp 0 are calculated for many instantiations of the Monte Carlo integrals and their distribution is shown in Figure 5. The contour is not centered at the input value due to the statistical noise inherent in the data realization for which the calculation is performed. The fit parameters are correlated, as seen in the rotated shape of the ellipses.
The errors in the estimators and their uncertainties, s.d.(μ) and s.d.(σ µ )/σ µ , are calculated for a range of scenarios as a function of N MC and presented in Figure 6. Qualitatively the results are similar to those for the Pure sample in §3.1. They span similar ranges of uncertainty, and have the same dependency on N MC . Both examples share the same sequence of m lim when ordering error sizes, though they do differ somewhat in their quantitative details.
In this case with two populations with different magnitude distributions and different probabilities of being classified as Population 1, the fraction of Population 1 and 2 objects entering the sample changes as a function of m lim .
Comparison of the Two-Population scenario of this section and the pure sample in §3.1 shows that the former requires an order-of-magnitude larger N MC in order to achieve the same precisions. The m lim -dependence of the variance of the MC sampling distribution p(m|S = 1, µ, p 0 ) is quite different than before, decreasing as the magnitude limit is turned on, reaching a minimum, and then increasing with increasing brightness cutoff when the sample predominantly consists of the bright tail of the second population. At m lim = 0 mag, the variance   of the two-population distribution is 1.4× larger than that of the pure distribution. Given that a significant fraction of the faint population does not pass sample selection, theS is ≈ 0.475 that of the pure distribution. Together these factors would account for a 6.2× larger N MC for the two-population model to match the parameter precision of a pure model. Of course a complete accounting of the comparison of the N MC of the two models depends on the integrands that appear in the partial derivatives ofS.

Discussion
Model inference and parameter constraints require the evaluation of an integral, which generally is not analytic and so is solved numerically. The integrand contains the terms S(m) and τ (m, x), which represent sample and classification selections respectively. At the time of analysis (say after the conclusion of the LSST survey), these functions whose properties will have been set long previous (say Year 1 of LSST), need to be evaluated many times. Understanding the computational implications for this aspect of the analysis and consequently the requirements for Brokers was the prime motivation of this work.
Conservatively speaking, both software and data states that influence sample and classification selections non-trivially, be they from the Project, third-party Brokers, or in-house pipelines, should be available for execution ∼ 10 years after real-time usage to feed cosmology analysis. We thus advocate the packaging of software in containers and the tracking data state histories, so that the processing at any given moment can be rerun at a later date on a different computer.
In the examples fitting for a single µ considered in §3, the number of function evaluations needed to get per-mil precisions is on the order of millions. Expanding the analysis by splitting the sample into ten independent redshift bins gets the number of evaluations up to the ten million real alerts per night expected from LSST. Nevertheless, the rich dataset from LSST can lead to more ambitious analysis and complex models to accommodate considerations such as photometric redshifts, non-Gaussian magnitude distributions of backgrounds that mimic SNe Ia, SN Ia subtypes.

Conclusions
The likelihood of an experiment that has sample selection includes an integral over all possible objects that could have but did not enter the sample. Monte Carlo integration is a viable way to numerically evaluate that integral. Being in the integrand, the sample-selection and classification efficiencies are represented by their values at the random points generated by the Monte Carlo. The precision of the integration depends on the number of points at which the integrand, and hence the efficiency, is sampled. Integration errors propagate into errors in the parameter estimators and the fractional errors in the parameter uncertainties; these uncertainties are independent of the number of objects in the analysis sample. Requirements of parameter estimation thus set the minimum number of points at which the efficiency must be evaluated.
For several example scenarios for measuring distances with SNe Ia, the sample selection efficiency must be evaluated for a number of simulated supernovae that far exceeds that of the sample itself. Otherwise the uncertainty in the distance estimatorμ due to the miscalculated likelihood can be a significant fraction of the overall µ uncertainty.
The number of Monte Carlo evaluations needed to meet precision requirements is sensitive to the model. The numerical evaluation of the integral needed to calculate the optimal estimator (e.g., Eq. 7) has variance ∝ σ 2 N N , where σ 2 N is the variance of the drawn values of the integrand. When the underlying model includes multiple source populations, the pdf of the observables p y broadens, the pdf of the draws of the integrand broadens, and the variance ofS gets larger, leading to an increase in the error of µ found in the root solution.
A survey's statistical performance improves when an informative classifier is used; in this way classifiers indirectly affect the required number of evaluations of the selection function. Characterizing the classifier itself requires simulated SNe, though in the formalism we present here their sample selection need not be assessed. Determining the required precision on classifier performance is beyond the scope of this study.
Further Work: More work is necessary to produce sample-selection requirements for a realistic LSST SN analysis. The toy models presented in this note are extreme simplifications of the one that will eventually be used to describe the LSST sample. Here we treated SNe Ia as standard candles but they are in fact standardizable candles with their own internal subparametrization. While we used a classification selection that is m-independent, the first and second moments of the sample and underlying population magnitudes will probably differ and induce biased sample magnitudes. The sample supernovae were taken to have the same observables whereas it is expected that there will be a subset of spectroscopically typed objects (with its own sample selection function) analyzed together with the photometric set. The magnitude distribution of supernovae was taken to be Normal, whereas the parameter-distributions of SNe Ia are not Normal and indeed must be modeled in the fit. The observed SN Ia magnitude dispersion (before color and light-curve shape corrections) relevant for a magnitude-limited survey is ∼ 0.4 mag, larger than the 0.1 mag considered here. An implication is that SN Ia subpopulations must also be included in the model. The true sample selection will be date-dependent, depending on the conditions of actual observing, and may be stochastic (i.e. 0 < S(m) < 1). Multiple sample selections may enter a single analysis, for example a training set of follow-up supernovae can be identified using only early data whereas the larger sample of supernovae would use data covering the full light-curve evolution. Here we considered the distance µ as the parameter of interest, whereas many experiments are evaluated based on the physically relevant dark energy equation of state parameters w 0 and w a . The more diverse the model for the underlying supernova populations, the more possible realizations there are to integrate over, which should increase the Monte Carlo sampling requirements to maintain precision in the likelihood. As the community continues to develop future analyses, the implications for characterizing the sample selection should be kept in mind.