Marginalised Normal Regression: Unbiased curve fitting in the presence of x-errors

The history of the seemingly simple problem of straight line fitting in the presence of both $x$ and $y$ errors has been fraught with misadventure, with statistically ad hoc and poorly tested methods abounding in the literature. The problem stems from the emergence of latent variables describing the"true"values of the independent variables, the priors on which have a significant impact on the regression result. By analytic calculation of maximum a posteriori values and biases, and comprehensive numerical mock tests, we assess the quality of possible priors. In the presence of intrinsic scatter, the only prior that we find to give reliably unbiased results in general is a mixture of one or more Gaussians with means and variances determined as part of the inference. We find that a single Gaussian is typically sufficient and dub this model Marginalised Normal Regression (MNR). We illustrate the necessity for MNR by comparing it to alternative methods on an important linear relation in cosmology, and extend it to nonlinear regression and an arbitrary covariance matrix linking $x$ and $y$. We publicly release a Python/Jax implementation of MNR and its Gaussian mixture model extension that is coupled to Hamiltonian Monte Carlo for efficient sampling, which we call ROXY (Regression and Optimisation with X and Y errors).


INTRODUCTION
The problem of determining the parameters of a functional fit to a data set is perhaps the most basic in science.Assuming no prior knowledge on the parameters this is achieved by maximising the likelihood of the data given the fit and the error model, and deriving confidence intervals on the parameters from the isocontours of log-likelihood.With uncorrelated Gaussian uncertainties this is trivial in case only the dependent variable ("y") is uncertain-up to a constant the log-likelihood is the negative sum over points of the squared residuals from the functional expectation divided by the squared uncertainty.
Unfortunately, the independent variable of the regression ("x") is itself often uncertain.Perhaps surprisingly, this greatly increases the complexity of the inference regardless of the functional form being fit.This has led to much confusion in the literature, with researchers often seeming to despair of there being a correct solution and instead running a battery of ad hoc algorithms and taking the spread in their results as a measure of systematic error Isobe et al. (1990a); Kelly (2012).For example, focusing on the case of linear regression, Boggs & Rogers (1990) suggest using the orthogonal distance regression (ODR) method, whereby one minimises the (weighted) sum of squares of the orthogonal distance between the observed data points and the line.A popular method for dealing with the asymmetry between the dependent and independent variable has been the so-called "forward" and "backward" methods, whereby one performs linear fits of y vs x and then x vs y, taking account the uncertainty on the dependent variable only in each case.The results are then combined in somewhat ad hoc ways (see, e.g.Isobe et al. 1990b;Feigelson & Babu 1992), including the "bisector" method where one chooses the parameters of the straight line that bisects these two fits.Another method, dubbed BCES (Akritas & Bershady 1996) is a weighted least squares estimator that extends the ordinary least-squares (OLS) method.This has been criticised by Tremaine et al. (2002) due to its being independent of the errors on the dependent variable and suffering from the potentially large impact of a few lowerror measurements, poor performance on mock data and a potentially numerically unstable denominator.
An important consideration is the way in which the regression method treats intrinsic scatter in the fitted relation.This describes variability in y at fixed x (or x at fixed y) beyond that accountable for by the uncertainties in the measurements, and may be due to correlation of x and/or y with one or more additional variables.Intrinsic scatter commonly arises when projecting the parameter spaces of physical systems into two dimensions, and cannot be reduced by performing more measurements.To allow for intrinsic scatter, Tremaine et al. (2002) advocate the minimisation of a χ2 function, where the sum of square residuals of the function are weighted by the sum in quadrature of the y errors, a variable characterising the intrinsic scatter, and the product of the gradient of the line and the x errors (sometimes called the "Nukers" method; see also Novak et al. 2006).This adjusted χ 2 fitting procedure was incorporated as an extension to the mpift package (Markwardt 2009) by Williams et al. (2010) 1 , where the intrinsic scatter is adjusted to give a reduced χ 2 of unity.
In cases where the methods described above have been tested against mock data, they have been found to be biased (e.g.Aguado-Barahona et al. 2019, 2022).In addition, some of them assume homoscedasticity (constant errors among the data points), no intrinsic scatter (i.e.perfect correlation between the underlying x and y values) or a linear underlying relation, or provide point estimates of the regression parameters without confidence intervals.As these methods are statistically unjustified we do not investigate them here.For in-depth treatises on measurement error models and their impact on regression see Fuller (1987); Carroll et al. (2006); Buonaccorsi (2010).
From a Bayesian perspective the setup of the problem is quite simple (see Section 2), and sees the emergence of new degrees of freedom in the hierarchical model describing the "true" x values x t , from which the "observed" values, x o , are derived by scattering according to the error model.The distribution of x t values underlies the inference but can only be inferred through its effect on the observed x o and y o values.The likelihood of the data given the x t and fit parameters is simple to write down (Eq.2).When the intrinsic scatter σ int in the relation is assumed to be 0 (i.e. the true model y value, y t , is deterministically related to x t ), the maximum of this likelihood is at the true parameter values.This means that the profile likelihood (equivalent to marginalising over x t given delta-function priors at their maximum-likelihood values) is unbiased.The neglect of the uncertainties of x t can however lead the profile likelihood to underestimate the uncertainties on the fit parameters.This can be overcome by marginalising over x t with a Jeffreys prior (Jeffreys 1946), which is reparametrisation invariant and hence removes volume effects.By contrast, marginalising over x t with a uniform prior does introduce volume effects such that the maximum of the marginal likelihood is not at the maximum of Eq. 2, biasing the inference.This parallels investigations into methods of marginalising over other types of nuisance parameter, as explored for example in Berger et al. (1999); Hadzhiyska et al. (2023).
The problem arises when the intrinsic scatter σ int between x t and y t is assumed not to be 0, putting a free parameter in the denominator of the exponential likelihood.In this case, we will show that the maximum of the likelihood is not at the true parameter values, so that both the profile likelihood and the one that is obtained by marginalisation over x t with a Jeffreys prior are biased.This is not a small effect: the maximum-likelihood values of σ 2 int are negative until the true σ 2 int exceeds several times the other variance scales in the problem.Marginalising over x t with a uniform prior produces a marginal likelihood with a much less biased σ int value at its maximum, but biases in both σ int and the parameters describing the shape of the function being fit remain.A significant part of the paper is devoted to investigating and quantifying these effects.
How then to achieve an unbiased inference in the presence of intrinsic scatter?One way to think about the problem is to realise that if the parameters of the functional fit are to be inferred correctly, so too must be the x t values.The distribution of these is however not known, and typically none of the marginalisation methods considered above have sufficient flexibility to capture that distribution a posteriori.The problem then is how to create a prior x t distribution that is sufficiently flexible to capture the true distribution but at the same time is computationally tractable given the limited data available.This is studied in Gull (1989); Mueller et al. (1996); Carroll et al. (1999); Schafer (2001); Huang et al. (2006); Kelly (2007); Carroll et al. (2009); Sarkar et al. (2014), where finally emerge effective solutions.There are two classes of approach, so-called parametric and nonparametric (or semi-parametric) models.In the former one proposes a parametrised form for the prior on the x t distribution, and those parameters are marginalised over in constraining the parameters of interest.The latter attempts to overcome the rigidity of fixed functional forms by adopting a fully general x t distribution, which may be achieved by histogramming or kernel density estimation, or modern machine learning methods such as neural networks or Gaussian processes.The parametric approach offers stability and interpretability at the cost of potential model misspecification, while the nonparametric approach offers robustness to bias at the cost of potential intractability of the inference if the size of the dataset is insufficient.
A particularly convenient class of parametric model is a Gaussian mixture model (GMM) which we will show to offer a general-purpose solution to the inference problem across the range of potential scientific applications.Indeed, we will show that in almost all cases a single Gaussian is sufficient to yield unbiased inference, even in cases where the true x t distribution is far from Gaussian.We show analytically that the bias of this method is exactly zero in the limit of infinite sample size and constant x and y errors, irrespective of the underlying x t distribution, and through an exhaustive set of numerical experiments that the bias very rarely exceeds 2σ.We dub this method "Marginalised Normal Regression (mnr)", and implement it to fit non-linear functions using non-diagonal covariance matrices.We publicly release an efficient Python/Jax (Bradbury et al. 2018) implementation of our method powered by Hamiltonian Monte Carlo, which we call roxy (Regression and Optimisation with X and Y errors). 2 This also offers the capability of fitting more than one Gaussian in cases where that alleviates the residual bias, as we discuss further below.
Our GMM algorithm is similar to the method of Kelly (2007), which has an IDL 3 and Python 4 implementation.This method has also been implemented in the R lan-guage with an extension to multiple variables (Linear Regression by Gibbs Samplinglrgs; Mantz 2016) and with time-dependent parameters (LInear Regression in Astronomylira; Sereno 2016).The primary advantages of roxy over these are efficiency (leveraging automatic differentiation and Hamiltonian Monte Carlo), flexibility (offering multiple sets of hyperpriors, the ability to automatically determine the optimum number of Gaussians, and the ability to fit nonlinear functions) and ease of use, as well as small algorithmic upgrades such as an ordering of the means of multiple Gaussians to remove the degeneracies between them.Further comparison with the method of Kelly may be found in Sec. 4.
This paper is structured as follows.To begin, we focus for simplicity on linear regression.In Section 2 we describe the setup of the problem and derive the likelihood of the data as a function of the straight line parameters and x t as well as the marginal likelihood when integrating out x t under various priors.We analytically calculate the properties of these marginal likelihoods under necessary simplifying assumptions (most importantly homoscedasticity), referring to Section A for detailed derivations.In Section 3 we generate mock datasets with characteristics spanning the range likely to be encountered in real-world applications and fit them with the various likelihoods to quantify the biases they induce.This will show that mnr is effectively never biased, while the other models almost always are.We also provide a comparison with the case of multiple Gaussians, demonstrating that one Gaussian is almost always sufficient for unbiased regression and that adding further Gaussians does not generically reduce bias.In Section 4 we apply the models to a real-world test case in cosmology to illustrate the kinds of differences they produce.In Section 5 we extend the models to nonlinear functions and an arbitrary covariance matrix, and in Section 6 we present our implementation in the public roxy program.Section 7 concludes.

THE PROBLEM AND ITS POSSIBLE
SOLUTIONS In this section we set up the seemingly simple problem of fitting a straight line to data, under the assumption of uncorrelated Gaussian errors on the x and y values.We present three methods for determining the parameters of the function: the slope A, the intercept, B, and the intrinsic scatter in the y direction, σ int .We generalise this problem to an arbitrary function and covariance matrix in Section 5.

Setup
We will consider the problem of assuming that the true values of y t = (y ) as We will denote θ as the vector of parameters that we wish to infer.This will include A and B, but potentially other parameters too, as outlined in the rest of this section.Of course, we do not observe the true values, but the set of observed values x o and y o which we assume are Gaussian distributed about the true values, with uncertainties σ and σ (i) y in the x and y directions for point i, respectively.We can then write the likelihood for some observed x o and y o given A, B and x t to be where we have included an additional intrinsic scatter, σ int , which is not captured by the measurement uncertainties and is another parameter we wish to infer.In a Bayesian context, we wish to use the likelihood to determine the posterior of x t and θ given x o and y o , where p(x t , θ) is the prior probability of x t and θ, and Z is the evidence.Given p(x t , θ), one can obtain a numerical approximation to the posterior through Markov chain Monte Carlo (MCMC).Note that, in general, one must sample both x t and θ since the likelihood is a function of both, although in certain cases one can analytically marginalise over x t and thus sample only θ.
We now arrive at the crucial question: what prior to impose on x t ?In the following subsections we consider an (improper) infinite uniform, a delta-function at x t values that maximise Eq. 2 (giving a profile likelihood over the x t , which we find to give very similar results to a Jeffreys prior), and a Gaussian hyperprior with mean and width to be inferred.

Infinite uniform marginalisation
A common choice (D'Agostini 2005) is to give x t (as well as A, B and σ int ) infinite uniform priors, such that p(x t , θ) is independent of x t .The motivation for this choice is that it is seemingly uninformative and maximally agnostic about the x t values, and hence might be thought least likely to introduce a bias if a more restrictive prior on x t is inaccurate.We will see that this is incorrect.
In this case, we can trivially marginalise over x t to obtain where we have introduced s int for notational convenience.We denote this likelihood unif.
Referring to Section A.1.1 for the full derivation, we find that the maximum a posteriori parameter values in the case of universal errors on x and y, i.e. σ (i) where the notation is defined at the start of Section A. In Section A.1.1 we then compute the difference between these values and the true values in the limit of an infinite sample size and find that these estimates are biased by an amount where a tilde denotes the true values.Thus, for positive Ã, we would underestimate the gradient and for all non-zero values of Ã we overestimate the intrinsic scatter.The sign of the bias on δ B is equal to the sign of the product of Ã and the mean of the underlying true x distribution.This clearly demonstrates that the unif method is unreliable.We note that variants of this method have been proposed, e.g.finite uniform priors (Hee et al. 2016), but it is beyond the scope of our work to investigate them in detail.

2.3.
Profile likelihood An alternative is to fix x t to the values that maximise the likelihood Eq. 2 as a function of A, B and σ int .The attraction of this approach is that the maximum of this profile likelihood is by construction at the same point in parameter space as the maximum of the full likelihood Eq. 2, which might be hoped to be unbiased (we will show that in general it is not).This is in contrast to any of the other methods we discuss, where marginalising out x t introduces a volume effect that shifts the maximumposterior point away from the maximum-likelihood point.
Maximising Eq. 2 with respect to each of the x (11) One can substitute these values back into Eq. 2 to obtain the profile likelihood for A, B and σ int , which can then be maximised or sampled to infer the parameters of the fit.We denote this likelihood prof.
We study this likelihood analytically in Section A.2. Simplifying again to the case of constant σ x and σ y , we find that there does not exist a single closed-form expression for all the parameters at maximum-likelihood.Instead, one must solve a cubic equation for ŝ2 (see Section B and Algorithm 1 for a discussion of the real analytic roots of this equation) and the desired root depends on the data used.Moreover, in some cases the maximum likelihood solution would be at ŝ2 < 0, which is clearly unphysical.Thus, not only does one need to consider the turning points of the likelihood, but also the case with σ int = 0 separately.We will find that in many cases this is the maximum likelihood solution for ŝ2 , and hence the profile likelihood often severely underestimates the intrinsic scatter.The differences between the maximumlikelihood and true values of the parameters also do not have a closed-form analytic expression; these are however non-zero in all the parameters, showing that the profile likelihood is biased.
In practice the Jeffreys prior |I| (Jeffreys 1946), where I is the Fisher information matrix, produces very similar results to the profile likelihood (see also Hadzhiyska et al. 2023).The Jeffreys prior is an attractive choice in the absence of informative prior information because it is reparametrisation invariant, and consequently eliminates volume effects deriving from marginalising over nuisance parameters such as x t .The profile likelihood is by construction devoid of volume effects, and gives only slightly smaller posterior widths for the other parameters than Jeffreys marginalisation.Just like the prof method, the Jeffreys prior does not produce an unbiased result because in the presence of intrinsic scatter the maximum of the likelihood Eq. 2 does not occur at the true parameter values.

Gaussian hyperprior
The above likelihoods have in common that they extract no knowledge about the x t values from the data set itself, but rather marginalise over them before the data has been considered.(The Fisher matrix appearing in the Jeffreys prior depends on the possible data values, but not on the actual measurements.)This is undesirable: would one think, for example, that the x t values of a dataset where 0 < x < 5 (and small σ x ) could lie anywhere in [−∞, ∞]?The alternative, which is key to removing the biases explored above, is to learn about the x t distribution from the data set itself.In the Bayesian hierarchical model given in Fig. 1, this involves inferring both the parameters of the equation relating x t and y t , θ, but also the parameters of the prior on x t , α.
The simplest nontrivial implementation of this is to give x t a Gaussian distribution whose mean µ and width w are free parameters to be inferred from the data.This P (α) α P (x t |α) -Directed acyclic graph representation of the Bayesian hierarchical model for a functional fit f in the presence of uncertainties in both the independent and dependent variable.Green boxes represent probability distributions, red boxes deterministic relations, blue ellipses observable quantities and white circles variables.For mnr, α contains the mean and width of the Gaussian hyperprior; for a linear fit, θ contains the slope and intercept. yields Although one could jointly sample the x t values alongside θ, in the majority of the cases we do not wish to find the x t values themselves, so we can analytically marginalise over them to obtain Again, we consider the case of equal errors for all data points to gain analytic insight into the bias of this method.As derived in Section A.3, the maximum likelihood values are which are the same as the results for the uniform case, except that we replace Var (x o ) with Var (x o ) − σ 2 x in those expressions.As demonstrated in Section A.3, this is the correction required to remove the bias of the unif method, and the maximum of the mnr posterior occurs at the true parameter values irrespective of the underlying x t distribution for large sample sizes.Much of the rest of the paper is devoted to showing that this remains true when the assumptions of constant uncertainties and large sample size are relaxed, and hence that mnr is a robust solution to the regression problem.
It is possible for one Gaussian not to be sufficiently flexible to capture the x t distribution.The simplest way to increase flexibility is to replace the Gaussian by a GMM, viz. a sum of N g Gaussians each with their own weights, means and variances.We will find in Sec.3.3 that a single Gaussian is typically sufficiently flexible to eliminate bias even in cases where the true distribution of x t is highly non-Gaussian, and hence that adding more parameters to the prior may not be justified given the increased complexity.Nevertheless we will show cases where adding more Gaussians does reduce small residual biases, and provide this as an option in roxy.

NUMERICAL EXPERIMENTS
Having established the basic properties of the three methods for integrating out x t , we now construct a systematic suite of numerical experiments to investigate their performance in full generality.This appears not to have been done before for any method, possibly because with less efficient sampling methods than Hamiltonian Monte Carlo it is computationally infeasible to sample the stochasticity of mock datasets thoroughly while spanning a high-dimensional parameter space.We achieve it by generating and fitting mock data sets with the following relevant properties: • True (generating) values of slope, intercept and intrinsic scatter Ã, B and σint .
• True (generating) distribution of x t , xt .
• Number of datapoints N .
• Distribution of x and y uncertainties.
However, not all of these are independently relevant for the quality of the regression: there are four degrees of freedom related to the translation and scaling invariance of the problem (where one puts the origin, and the units of x and y).Fixing one of the parameters of the xt distribution and the true intercept, B, sets the origin, and the second parameter of the xt distribution and the mean y error sets the units.Thus we fix B = 1 and σ We take the mean σ x , σx , to be a free parameter, and set σ x ← G (σ x , σx /5), where G is a Gaussian distribution with a mean given by the first argument and standard deviation given by the second.
We consider three xt probability distributions: a uniform between 0 and 30, a triangular (or ramp) distribution rising linearly from 0 to 30, and an exponential peaking at 0 with a scale length of λ x .The results are qualitatively similar in each case, with the exponential distribution, being the most non-Gaussian, providing the toughest test for mnr.We therefore focus on this distribution for the remainder of the section.The mock data is generated by first drawing N uncorrelated xt samples, then calculating the true y values as ỹ(i) The observed values are then generating according to and y In all cases we derive posteriors using the No U-Turns (NUTS) method of Hamiltonian Monte Carlo (HMC) as implemented in the numpyro sampler (Phan et al. 2019;Bingham et al. 2019).We use 700 warmup and 5000 sampling steps and verify that the Gelman-Rubin statistic r satisfies r − 1 < 10 −3 .We impose uniform improper priors on A, B and σ int , such that A and B can take any value and σ int is forced to be positive.Similarly, in the case of mnr, we place improper uniform priors on µ and w, where µ can take any value and w is positive.These are the defaults in roxy.

1D variations
The remaining parameter space comprises Ã, σint , N , σx and λ x .As a first test, we fix four of these to fiducial values, as shown in Table 1, and vary the fifth within its full range using 20 equal steps.For each value of this free parameter, and for each likelihood, we generate 150 mock data sets differing only in the random seed used to generate the xt , σ x , σ y , x o and y o values.For each one we calculate the bias in parameter p as (p − p)/δp.For p = {A, B} the overbar denotes the mean of the parameter's posterior and δ its standard deviation.For p = σ int , due to the lower bound of 0, they instead denote the mode and standard deviation of a normal fit that has been truncated at 0. We summarise the results over the 150 mock data sets by the median and 68% confidence interval of this bias, plotting them against the parameter that is varying.These results are shown in Fig. 2. We see immediately that for mnr the bias values are centred at approximately 0 in all cases, with a 68% confidence bound extending to ±1σ bias as expected.The other methods regularly achieve large biases and are therefore highly unreliable.This is particularly severe in the prof inference of σ int which is pinned to 0 for σint ≲ 9 for our fiducial parameter values; this produces such large biases that typically they cannot conveniently be shown on the plots.
To supplement Fig. 2, we show in Fig. 3 the true vs inferred σ int for the three methods at the fiducial values of the other parameters.We calculate here only the maximum-likelihood values (i.e.without posterior uncertainties), showing the median and 95% spread from 300 random mock datasets at each σint value for each method along with the residuals with respect to the truth for the unif and mnr methods in the lower panel.This shows that while the unif and mnr methods recover approximately correct σ int from small to large values, the profile likelihood returns σ int = 0 until σint is several times larger than σ x and σ y , beyond which it is slightly biased low.We have already observed this behaviour in Section 2.3.

5D variations
By fixing all but one parameter, this test does not provide a thorough exploration of the parameter space of possible data sets.While we readily see that the prof and unif methods are highly biased, there may be datasets where even mnr is unreliable.We therefore now consider the full 5D grid in Ã, σint , N , σx and λ x , sampled uniformly (logarithmically in N ) between the limits [-30,30], [0,20], [10,4000], [0,20] and [1,15] respectively with 5 grid points along each dimension.At each grid point we generate 150 mock datasets with the corresponding parameters and fit them using mnr, calculating in each case the bias in A, B and σ int as defined above.We then locate the points in the 5D space where the average bias in A, B or σ int is greatest in either direction, and at these points we calculate the distribution of bias values in the three parameters over the mock data sets.
We find the points in the [ Ã, σint , N , σx , λ x ] space which produce the largest biases are • A most biased high: [-15, 894, 10, 20, 15], • A most biased low: [15, 4000, 0, 15, 15],    1).For each σint value we generate 300 mock data sets and show the median and 95% range of the distribution of maximum-likelihood values, along with the residuals with respect to the truth in the lower panel.
The distributions of biases in each parameter at these points are shown in Fig. 4. We see that the average bias of mnr never exceeds ∼1σ, showing it to be fully reliable across the parameter space.These maximum biases are typically obtained for large x-errors, a nearly flat x t distribution and very large sample size.We have repeated this calculation for the prof and unif methods (not shown), where we find enormous maximum biases: ∼(10, 10, 750)σ for (A, B, σ int ) in prof and ∼(200, 1250 and 500)σ for unif.

Gaussian mixture model
Finally, we consider whether the use of additional Gaussians in the x t prior can alleviate the small biases seen on occasion for mnr.An important consideration here is the choice of priors assigned to the hyperparameters describing the Gaussians.We consider two possibilities: 1) uniform priors (although requiring that the means of the Gaussians are ordered to break the degeneracies between the different Gaussians during sampling), paralleling the case of mnr, and 2) a hierarchical set of priors as described in Roeder & Wasserman (1997); Carroll et al. (1999); Kelly (2007) (see also Escobar & West 1995).This introduces hyper-hyperparameters coupling the properties of the Gaussians, favouring similar means and widths as Kelly suggests to be a desirable property.
In the "hierarchical" model, a Gaussian prior is placed on the mean of each Gaussian in the mixture, with some mean µ * and variance u 2 * .Additionally, a scaled inverse χ 2 prior is adopted on the variance of each Gaussian, with one degree of freedom and scale w 2 * .This introduces three additional parameters to the inference: µ * , u 2 * and w 2 * .A uniform prior is placed on µ * and w 2 * , and a scaled inverse χ 2 prior is chosen for u 2 * , with one degree of freedom and scale parameter w 2 * .We find the two sets of priors to give very similar results, and use the hierarchical set fiducially as it is more common in the literature.
How many Gaussians should one use?In principle one can take the number of Gaussians N g as a free parameter and marginalise over it as part of the inference (Richardson et al. 2002;Richardson & Green 2002).Since the number of parameters in the inference is a function of N g this requires trans-dimensional inference methods such as reversible jump MCMC (Green 1995).An alternative is to use the Bayesian evidence to determine the value of N g that optimally trades off the gain in likelihood when using more Gaussians with the increase in the number of free parameters.This is however difficult to compute, so one may wish to approximate it through the Bayesian Information Criterion (BIC).This is defined as Schwarz (1978 where k is the number of free parameters of the model.The BIC is the limit of the evidence when the posterior is modelled as a multi-dimensional Gaussian around the maximum a posteriori point and the number of data points is much larger than the number of free parameters.We implement this method in our code, providing a simple way of determining the optimum model.Note however that using more Gaussians does not necessarily lead to a reduction in bias (Carroll et al. 1999), and if there are too few datapoints the inference of a large number of Gaussian hyperparameters may not be feasible.The safest choice is therefore to perform a sensitivity analysis, examining the results for a range of different N g (conservatively 1-10).The constraints are robust if they TABLE 2 Distribution of the number of Gaussians that minimises the BIC for each of the runs considered in Fig. 6 (out of 150 mock data sets).For these runs, the mode of the distribution is always ≤ 5, suggesting that a maximum Ng of 10 is sufficient to identify the minimum BIC in all cases.
are stable across multiple N g values.
To begin investigating the impact of N g , we repeat the procedure used in Fig. 2 for varying slope (first row) and mean x-error size (fourth row) but using two Gaussians instead of one.The results are shown in Fig. 5, where we see that the bias in the first case is unaffected while in the latter it is somewhat reduced.
We now consider the cases which were most biased when only using one Gaussian (Fig. 4).Here we rerun the inference for each dataset using a GMM containing between 1 and 10 Gaussians (with 150 mock datasets as before), and plot the bias as a function of N g in Fig. 6.For each dataset, we compute the BIC for each number of Gaussians, and show as a separate case the bias distribution when choosing the number of Gaussians that minimises this.The distribution of this number for each run is shown in Table 2.We observe that the bias distribution often changes very little as one varies the number of Gaussians, such that there is not a significant, systematic difference between choosing just one Gaussian, the number that minimises the BIC or any other.So, although it may seem preferable to choose more than one Gaussian to find a better approximation to the distribution of x t , this does not appear to be necessary to improve the inference on A, B or σ int .Indeed, in some cases the bias is increased by increasing N g .Also recall that the parameters used in Fig. 6 are those which are most biased when using one Gaussian, such that one may expect the biggest improvement by increasing this number.In Fig. 7 we repeat this analysis using the fiducial parameters (for which one Gaussian already works near-perfectly) and see that the effect of increasing the number of Gaussians is even smaller.

REAL-WORLD TEST CASE: THE MASSES OF
GALAXY CLUSTERS While the experiments of Section 3 show the possible differences between the unif, prof and mnr results, many of the parameter choices were deliberately rather extreme to stress-test the methods.We now apply the methods to astrophysical data to illustrate the importance of the method in a real-world setup.
Galaxy clusters, tracing the densest regions of the universe, are a fundamental tool in cosmology.However, their most basic property, mass, is not directly observable, and must therefore be estimated by means of an observable proxy.This may be the cluster's "richness" (the number of galaxies it contains), the luminosity of  2) when describing the x t distribution using two Gaussians.The bias is reduced somewhat in the latter case, but not the former.
X-rays emitted by hot gas in the intra-cluster medium, the velocity dispersion of galaxies in the cluster or the Sunyaev-Zel'dovich (SZ) decrement (Sunyaev & Zeldovich 1972) when observing the cosmic microwave background through the cluster.Here we directly compare the mass estimates using the latter two methods.We -Same as Fig. 6, but using the fiducial parameters given in Table 1.The bias in each parameter is consistent with zero in all cases, and the very small mean biases in the case of one Gaussian are not reduced by using more.and intercept of the relation.1 − b is interpreted as the ratio of biases between the two measured masses and the true masses of the objects, which would be unity if both methods accurately determined the true mass (or had the same bias).
To fit this relation, Aguado-Barahona et al. consider a number of methods: orthogonal distance regression (Boggs & Rogers 1990), the Nukers method (Tremaine et al. 2002), the method we dub unif (Sec.2.2), the bivariate correlated errors and intrinsic scatter method (Akritas & Bershady 1996), and the Gaussian mixture model method of Kelly (2007) (which they call Complete Maximum Likelihood Estimation, CMLE).They find that all of these methods give biased results when applied to mock data, but that the Nukers method gives the most reliable bias, and is thus easiest to correct.In Section 3 we demonstrated that mnr robustly gives unbiased results for a range of settings and distributions of x t , and the properties of this dataset lie well within the 5D parameter space we explore.As before, we fit the prof, unif and mnr methods using uniform priors on the slope and intercept of Eq. 19, the latter of which we subsequently convert to 1 − b through exponentiation.
The results are shown in Fig. 8.We see that the xuncertainties for the relation are sizeable, with average values for log M dyn 500 and log M SZ 500 of 0.44 and 0.10, respectively.The large number of data points means that we have a reasonable constraint on the function within the range of the data, as shown by the posterior predictive distribution for mnr.However, the sizes of the uncertainties result in the three methods giving very different results for the parameter values.We plot the twodimensional posteriors for α, 1 − b and σ int (and µ and w in the case of mnr) in Fig. 9.While σ int is 0 for the prof method and compatible between unif and mnr (cf.The inferred value from the prof method is compatible with that reported by Aguado-Barahona et al., since the "Nukers" method used in that paper is essentially the same as prof given that the inferred intrinsic scatter is consistent with zero.The small inferred α and the large inferred σ int for the unif method is exactly the trend expected given the bias of this method discussed in Section 2.2 and computed in Section A.1.1.We thus see that in this case it is crucial to use the principled and unbiased mnr method rather than any of the other x t marginalisation methods or ad hoc algorithms proposed in the literature.
Our best-fit value of the slope is very similar to that quoted by Aguado-Barahona et al. for the CMLE method implemented in the Python port of Kelly's IDL code.They find this slope to be too low by ∼20% in mock data, which, given the posterior uncertainties, amounts to a ∼3σ bias.This is unexpected as CMLE is approximately equivalent to our GMM method, which we have shown to give very similar results to mnr.To investigate this in the context of our bias results we generate mock data identical to the cluster data except that the y t values are generated according to the best-fit mnr line, and with five models for the unknown x t distribution.These are: x to produce a new x o in each mock dataset); • x t set to their maximum a posteriori values given the mnr fit to the data; • x where µ and w are the maximum a posteriori mean and width of the x t hyperprior from the mnr fit; • x (i) t ← U(−0.8, 0.6) (relatively narrow range at the centre of the dataset); For all choices except the second we find the distributions of bias values in each of the slope, intercept and intrinsic scatter over many mock datasets to closely approximate a standard normal, as would be expected if the method is performing accurately.In the second model, however, we find the slope to be biased low and the intercept high by around 1σ on average.The tail of the bias distributions extend to ∼ ±3σ in this case, so it is unlikely but possible that in a given mock data realisation mnr could be biased by the amount attributed by Aguado-Barahona et al. to CMLE, although across the entire population the bias is no worse than is seen in Fig. 4 and therefore not a cause for serious concern.Alternatively it may be that the bias testing method of Aguado-Barahona et al. is itself inaccurate.
Upcoming work will rectify the biases caused by regression method in a slew of astrophysical and cosmological fits (von Hausegger et al. 2024, in prep).

EXTENSION TO NONLINEAR FUNCTIONS
AND NON-DIAGONAL COVARIANCE Until now we have supposed that the x and y errors are completely independent-there was no covariance among the observed x values, among the observed y values, or between x o and y o -and we focused on linear regression problems.The unif, prof, mnr and GMM likelihoods that we consider are readily generalisable to arbitrary covariance matrices and functional forms of the fit, provided the x-uncertainties are small compared to the curvature scale of the function.
We now assume that the true values of y t = (y (1) t , . . ., y (N ) t ) are related by some deterministic function to the true values of x t = (x (1) t , . . ., x (N ) t ): where θ is the parameter vector we wish to infer.Note that throughout this section we do not assume that the value of y (i) t only depends on x (i) t and θ, and thus the results hold for arbitrary functions which can combine different x values in the calculation of any given y value.This generalises Eq. 2 to where we have dropped the dependence of f on θ for convenience.Here Σ is the covariance matrix of the data, which we define as If one wishes to include intrinsic scatter in the relation, one simply needs to add σ 2 int to each of the diagonal elements of Σ yy , and this can be inferred alongside θ.
We now assume that we can approximate the function f as linear across the range spanned by each xuncertainty.We therefore write where Note that it may be possible to reparametrise a function that appears highly nonlinear to make it near-linear.This is the case for example with power-law fits such as Eq.19.Unlike the straight line case, whenever a nonlinear function is fitted Eq. 26 causes some error, which the user is reminded to be mindful of.The approximation breaks down completely at points where the function's second derivative times x-error size exceeds the first derivative.

Infinite uniform marginalisation
Beginning with the case of infinite uniform marginalisation, we must exponentiate Eq. 24 and substitute in Eq. 26.We see that to integrate this function over x t and thus obtain the marginalised uniform likelihood, we have to perform a Gaussian integral where Fig. 8.-SZ-inferred cluster mass, M SZ 500 , dynamical mass, M dyn 500 , for Planck SZ clusters with optical counterparts, as given in Aguado-Barahona et al. ( 2022) (black points with errorbars).We plot the posterior predictive distribution of the mnr-inferred relation (Eq.19) in greyscale, including the confidence range up to 3σ, and the posterior mean predictions for all three methods are plotted as coloured lines.-Corner plot of the slope, normalisation and intrinsic scatter of the relation between SZ-inferred cluster masses and dynamical masses, as given in Eq. 19, alongside the parameters of the Gaussian hyperprior for mnr.In this case the x-errors result in significant differences between the methods.
Performing this integral, one obtains where To obtain an expression for D, we use the Woodbury matrix identity When one utilises the block-wise inversion formula for matrices, one can identify these terms with the elements of the covariance matrix used in Eq. 24, i.e.
and thus one can write 5.2.Profile likelihood We now move onto the profile likelihood case.After substituting Eq. 26 into Eq.24, we differentiate with respect to x t to find the maximum likelihood point, xt , One can then substitute this into Eqs.24&26 to obtain the maximum likelihood value As in the simpler case considered before, we see that the second terms of Eqs.30&39 are the same, and only the first terms differ.

Marginalised Normal Regression
For the multi-dimensional mnr case, we impose a prior on the x t values to be ) where W is the covariance matrix specifying the distribution of x t parameters; this may be proportional to the identity if the x t are iid draws, however we do not assume this in the following.Using this prior, one must perform the Gaussian integral where we have defined This yields where for and where the determinant term follows from normalisation requirements.This can be written as which can be verified by direct substitution.This provides a simple expression for the mnr result with an arbitrary covariance matrix and function being fitted.Each of these likelihoods can then be sampled or maximised to obtain estimates for the parameters of the function, the intrinsic scatter, and µ and w in the case of mnr.The sampling of the parameters of arbitrary functions with arbitrary covariances is implemented in roxy.In the case of diagonal covariance 6. ROXY: A PUBLIC, SIMPLE AND EFFICIENT IMPLEMENTATION OF MNR We design and release an implementation of mnr, dubbed roxy (Regression and Optimisation with X and Y errors).This Python code uses automatic differentiation enabled by Jax both to sample the likelihood using Hamiltonian Monte Carlo and to compute the derivative of f (x, θ) (cf.Eq. 27) required in the likelihood if f is nonlinear.We employ the NUTS method implemented in numpyro, and assume uniform priors on all parameters in θ, although their range can readily be altered by the user.roxy additionally calculates the maximum likelihood point using Nelder-Mead algorithm (Gao & Han 2012), in case one wants only point estimates of the parameters.As well as mnr and the GMM method, roxy implements the unif and prof likelihoods, although, in light of the results of Secs. 2 and 3, these are not recommended in general.
By using the NUTS sampler, roxy is able to produce MCMC chains very quickly.For the galaxy cluster example in Section 4 (which contains over 250 data points), a single chain run on a laptop performs approximately 3500 iterations per second, such that a chain with 700 warm-up steps and 5000 samples takes approximately 1.6 seconds to sample.Over 3500 effective samples are produced for each parameter, with Gelman Rubin statistics equal to unity within less than 10 −2 .Given its efficiency and simplicity to use (one just needs to define the function to fit, the parameters to sample and their prior ranges), we recommend roxy not just in the presence of x errors (in which case mnr should be used), but also in the absence of x errors.In this case, the unif method with σ x = 0 is more efficient because mnr would still sample the (unconstrained) extra hyperparameters describing the Gaussian prior.
roxy contains a number of failsafes designed to mitigate the risk of failure during sampling, raising warnings if the number of effective samples for any of the parameters is less that 100 and if the peak of the posterior is near the edge of the prior.This likely indicates a pathology in the data set or that the prior range is too small.A warning is also raised if the function's second derivative times the x-error size exceeds the first derivative at any point, which indicates that the first-order Taylor expansion (Eq.26) may be unreliable.As higher order approximations are computationally intractable, we suggest the user to reparametrise the data to make it more linear in this case.
A property of the mnr (as well as unif and prof) likelihood is asymmetry between x and y, so that the regression result depends on which variable is considered independent.Ideally the direction of regression would match the direction of causality in the problem.This may be assessed by treating the scatter of the points around the best-fit line as an additive noise model, in which case the independent variable may be identified as the one that has least correlation with the residuals of the fit (Hoyer et al. 2008;Mooij et al. 2009;Peters et al. 2014;Mooij et al. 2016).The roxy function assess_causality fits both the forward and inverse relations to the dataset, produces plots of the data in both directions with both regression models overlaid and the corresponding normalised residuals plotted against the "x" variable, and calculates various correlation coefficients.From these it makes a recommendation as to which variable to treat as independent.We caution however that the correlations may be non-monotonic and/or nonlinear, and therefore recommend picking the regression direction that visually produces the least correlation in the residuals.Typically the correct regression direction also produces the betterlooking fit to the data in both directions, although in some cases none of these diagnostics are conclusive.Further information and an example of the procedure may be found in the online documentation.
The code is released at https://github.com/DeaglanBartlett/roxy, with documentation at https: //roxy.readthedocs.io.As well as returning posterior samples and allowing likelihood computations (which can be integrated into the user's larger code), roxy is interfaced with arviz (Kumar et al. 2019) to produce trace plots, corner (Foreman-Mackey 2016) and getdist (Lewis 2019) to make two-dimensional posterior plots, and fgivenx (Handley 2018) for posterior predictive plots.Users are welcome to contact the authors for assistance and to propose modifications and additional features.
7. DISCUSSION AND CONCLUSION This paper analyses the ostensibly simple problem of fitting a function to data with errors on both the dependent and independent variables.A seemingly sensible approach is to use a Gaussian likelihood with a mean given by f (x o , θ) and a variance σ 2 y + f ′ (x o , θ) 2 σ 2 x , which corresponds to marginalising over the latent "true" x values, x t , under the assumption of an infinite uniform prior.We have shown analytically and numerically that this method yields biased results, demonstrating that this choice of prior on x t is inappropriate.We also show that simply maximising the likelihood with respect to the x t values (which is approximately equivalent to marginalising over them with a Jeffreys prior) also gives biased results, which is particularly noticeable when attempting to infer the relation's intrinsic scatter.
We find that the only choice of prior that gives consistently unbiased results is a mixture of one or more Gaussians with means, widths and weights inferred simultaneously with the parameters of the function being fitted.We show that typically one Gaussian is sufficient, and dub the corresponding inference method Marginalised Normal Regression (mnr).We show analytically that for infinite sample sizes with constant uncertainties this gives perfectly unbiased results regardless of the true distribution of x t values, and numerically that this approximately holds also for finite sample sizes and varying uncertainties.
We publicly release a simple and efficient Python implementation of mnr called roxy (Regression and Optimisation with X and Y errors), which is built on Jax (Bradbury et al. 2018) and numpyro (Phan et al. 2019;Bingham et al. 2019) to allow efficient sampling using the No U-Turns method of Hamiltonian Monte Carlo.This code can handle both linear and non-linear functions and correlated errors between all the observed x and y values (i.e.arbitrary covariance matrices).
To demonstrate the importance of regression method, we fit the scaling between the dynamically estimated masses of cluster and those inferred from the Sunyaev-Zel'dovich effect.The slope of the linear fit is biased by 4.3σ and 6.4σ for the uniform marginalisation and profile likelihood methods respectively, and the parameter controlling the normalisation by 2.5σ and 2.1σ.The profile likelihood also drastically underestimates the relation's intrinsic scatter.The MCMC chains used for this example took only a few seconds to run with roxy.
We identify four ways in which this work could be extended.First, although our method works for arbitrary functional forms our bias investigation has been restricted to linear regression; it would be useful to extend this to the fitting of nonlinear functions and assess the impact of N g in the GMM model in this case.Second, it would be useful to extend mnr to arbitrary non-Gaussian (e.g.asymmetric) uncertainties, which are sometimes encountered in real-world applications.Third, as discussed in Sec.6 our formulation of the problem has an intrinsic asymmetry between the treatment of x-which has corresponding latent variables x t -and y, which is said to be determined by x up to the uncertainties and intrinsic scatter.While relations in science are indeed typically assumed to be causal with the independent variable determining the dependent, and hence possess this asymmetry, there may be cases where a more symmetric treatment is desirable.An example would be if the intrinsic scatter is assumed to be Gaussian perpendicular to the fitted curve rather than in a fixed (e.g.y) direction.A more general method capable of addressing this should be developed.Finally, additional capabilities that could on occasion be useful are the ability to model outliers as a separate population to inliers, and the ability to handle missing or truncated data.
Given the strong biases incurred when assuming a Jeffreys or infinite uniform prior on x t , or the profile likelihood, such assumptions must be avoided in analyses requiring regression.We have identified and rigorously stress-tested a method for overcoming these biases, Marginalised Normal Regression, and made it easy to implement through our public program roxy.
For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.Some of the results in this paper have been derived using the fgivenx (Handley 2018), getdist (Lewis 2019), Jax (Bradbury et al. 2018), numpyro (Phan et al. 2019;Bingham et al. 2019), numpy (Harris et al. 2020) and scipy (Virtanen et al. 2020) packages.

A.2.2. Biases
Given that there does not exist a single, simple expression for the inferred values of Â, B and ŝ2 , we cannot compute general closed-form expressions for their biases.Since many of the terms used to compute the various solutions for ŝ2 when using Eqs.A37&A38 do not cancel (see Algorithm 1 and Section B), we find ŝ ̸ = s and hence that this method is biased.For a quantification of this, it is more instructive to consider the more general numerical experiments of Section 3, rather than these analytic expressions.

A.3. Gaussian hyperprior A.3.1. Maximum likelihood estimates
We now assume the x t are drawn from a Gaussian distribution of mean µ and width w, where these hyperparameters themselves have uniform priors.In this case, the marginalisation over x t gives log P (A, B, s, µ, w|x o , Let us begin by maximising this likelihood with respect to µ to give To make further progress, we must now maximise with respect to A, w and s.Doing this, we find for A Â ŵ2 σ 2 x K = Â2 ŵ2 σ 2 x + ŝ2 ŵ2 + σ 2 One can then substitute Eq.A52 into Eq.A43 to obtain B.

A.3.2. Biases
We now compute the biases for these three maximum likelihood values, using the results in Eqs.A8,A9&A10.

Fig. 2 .
Fig. 2.-Distribution of biases incurred by the mnr (left), unif (centre) and prof (right) methods in the slope (blue), intercept (red) and intrinsic scatter (green) of the straight line as various parameters of the problem are varied with the others held at their fiducial values (see Table. 1).In most cases prof gives such strong negative bias in σ int that the green curves lie far beyond the range of the plot.While the unif and prof methods are often wildly biased, mnr has a sub-1σ average bias in all cases, with the bias distribution typically conforming very closely to the expected standard normal.
Fig. 2.-Distribution of biases incurred by the mnr (left), unif (centre) and prof (right) methods in the slope (blue), intercept (red) and intrinsic scatter (green) of the straight line as various parameters of the problem are varied with the others held at their fiducial values (see Table. 1).In most cases prof gives such strong negative bias in σ int that the green curves lie far beyond the range of the plot.While the unif and prof methods are often wildly biased, mnr has a sub-1σ average bias in all cases, with the bias distribution typically conforming very closely to the expected standard normal.

Fig. 3 .
Fig.3.-Trueintrinsic scatter vs that inferred by the prof, unif and mnr methods with all other parameters at their fiducial values (see Table1).For each σint value we generate 300 mock data sets and show the median and 95% range of the distribution of maximum-likelihood values, along with the residuals with respect to the truth in the lower panel.

Fig. 4 .
Fig. 4.-Distribution of biases incurred by mnr at the most biased points in our 5-parameter grid search.Separate panels show the results when the slope, intercept and normalisation of the fit are separately maximally biased high and low, and in each case we show the bias in all three parameters.The average bias never exceeds ∼1σ, and for any mock dataset's random stochasticity it does not exceed ∼3σ.

Fig. 5 .
Fig. 5.-Distribution of biases incurred during variation of the true slope (top) and average x-error size (bottom; see Fig.2) when describing the x t distribution using two Gaussians.The bias is reduced somewhat in the latter case, but not the former.
Fig.6.-Distribution of biases across 150 mock data sets as a function of the number of Gaussians used in the GMM for the same parameters as Fig.4.We also show the distribution of biases for the number of Gaussians that minimises the BIC, which can be different for each dataset.The bars and bands show the 1σ deviation among the mock data sets.Any improvement by increasing the number of Gaussians or choosing the best number according to the BIC is marginal, suggesting that one Gaussian cannot generically be improved upon.
and with respect to B to giveB = ⟨y o ⟩ − Â ŵ2 + σ 2 x ŵ2 ⟨x o ⟩ + σ 2 x μ .(A41)By substituting this value of B into the equation for μ, we find that many terms cancel such thatμ = ⟨x o ⟩,(A42) independent of all other parameters.Substituting this back into our expression for B, we have B = ⟨y o ⟩ − Â⟨x o ⟩. (A43) results to obtain ŵ, we rewrite Eq.A45 asŵ2 + σ 2 x 2 Var (y o ) − ŝ2 − 2 Â ŵ2 ŵ2 + σ 2 x Cov (x o , y o ) + Â2 ŵ2 ŵ2 Var (x o ) − ŵ2 + σ 2 Properties of the mock data sets used in our numerical experiments.The fiducial value is used in cases where the parameter is fixed (see Sec. 3.1).Parameters that are never varied have range quoted as "-".