Sparse Bayesian mass-mapping using trans-dimensional MCMC

Uncertainty quantification is a crucial step of cosmological mass-mapping that is often ignored. Suggested methods are typically only approximate or make strong assumptions of Gaussianity of the shear field. Probabilistic sampling methods, such as Markov chain Monte Carlo (MCMC), draw samples form a probability distribution, allowing for full and flexible uncertainty quantification, however these methods are notoriously slow and struggle in the high-dimensional parameter spaces of imaging problems. In this work we use, for the first time, a trans-dimensional MCMC sampler for mass-mapping, promoting sparsity in a wavelet basis. This sampler gradually grows the parameter space as required by the data, exploiting the extremely sparse nature of mass maps in wavelet space. The wavelet coefficients are arranged in a tree-like structure, which adds finer scale detail as the parameter space grows. We demonstrate the trans-dimensional sampler on galaxy cluster-scale images where the planar modelling approximation is valid. In high-resolution experiments, this method produces naturally parsimonious solutions, requiring less than 1% of the potential maximum number of wavelet coefficients and still producing a good fit to the observed data. In the presence of noisy data, trans-dimensional MCMC produces a better reconstruction of mass-maps than the standard smoothed Kaiser-Squires method, with the addition that uncertainties are fully quantified. This opens up the possibility for new mass maps and inferences about the nature of dark matter using the new high-resolution data from upcoming weak lensing surveys such as Euclid.


INTRODUCTION
Gravitational lensing is the phenomenon where light from distant objects is distorted by the density field between the object and the observer, resulting in stunning images of warped, sheared, magnified and multiplied galaxies.Lensing is due to both light and dark forms of matter, making it a promising probe of the nature of dark matter (Heavens 2009).Weak lensing is the regime of small distortions, the effect of which on distant galaxies can be described as a magnification due to a convergence field  and a perturbation of intrinsic ellipticity due to a shear field .
The convergence field  is the integrated total mass density along the line of sight (Bartelmann & Schneider 2001;Dodelson 2017), and hence is a measure of the matter overdensity field.The density field contains both Gaussian and non-Gaussian structures.The Gaussian structures of the cosmological initial conditions evolve under the non-linear influence of gravity to produce non-Gaussian structures.As such, it is essential that methods to build maps of the convergence field (mass maps) contain Gaussian and non-Gaussian structures.With non-Gaussian structures, the higher-order statistics such as Minkowski functions and bispectrum (Munshi & Coles 2017) of the mass maps can be measured in addition to the usual two-point correlation function (Peebles 1980).From these statistics, one can compare with the predictions of the statistical density distributions of ΛCDM or other cosmologies.Further, one can compare the total density distribution † Correspondence address: augustin.marignier@anu.edu.uk in colliding cluster systems with the observed distribution of baryonic matter to infer the presence and collisional properties of dark matter (e.g.Clowe et al. 2006;Harvey et al. 2015).These sorts of comparisons would be much better informed if uncertainties were also quantified with the mass maps.
Unfortunately,  cannot be directly observed, and must be constrained by an ill-posed inverse problem of observed ellipticities modelling the shear field .There have been many methods proposed to solve this inverse problem.The standard method is the Kaiser-Squires (KS, Kaiser & Squires 1993), which is a simple Fourier space linear operator.However, it is well known that this method is not robust to noise and missing data, and is often smoothed to mitigate the effects of noise at the expense of small-scale non-Gaussian structures.Despite this the mass maps recovered with KS are generally reliable, and the method has been extended to the sphere (Wallis et al. 2021) for use in wide-field surveys (Van Waerbeke et al. 2013;Chang et al. 2018;Jeffrey et al. 2018Jeffrey et al. , 2021;;Price et al. 2020b).Recently, there has been a push for methods with sparsitypromoting regularisation to preserve non-Gaussianities and deal with irregular data sampling (e.g.Lanusse et al. 2016;Price et al. 2021;Starck et al. 2021), where sparsity is typically promoted in a wavelet basis.Sparsity is the property where many of the coefficients representing a signal in a given basis are (close to) zero -a property empirically observed in many images (see e.g.Donoho 2006;Candès et al. 2011).
Despite all these methods, uncertainty quantification for mass-mapping remains a difficult task.Common approaches for uncertainty quantification involve computationally expen-simulation data, which we discuss further in Section 6.We conclude in Section 7.

Weak Lensing and Mass-Mapping
Gravitational lensing describes the deflection of photons from distant sources as they pass by regions of local gravitational potential variations caused by the local over or under density of matter acting as a lens.The weak lensing effect can be described in terms of a 3D lensing potential () = ( 1 ,  2 , ), which is the integral along the line of sight of the Newtonian gravitational potential Φ() Here,  is the comoving distance from the observer to the source, ( 1 ,  2 ) are angular positions on the sky,  is the speed of light in a vacuum, and   () is the comoving angular separation in a cosmology with curvature , which we take to be 0 (i.e. a flat universe).The gravitational potential is related to the fractional matter over-density field () by Poisson's equation where Ω  is the current matter density parameter,  0 is the current Hubble constant, and () is the scale factor.The aim of mass-mapping is to infer the convergence field () from galaxy distortion measurements representing the complex shear field () =  1 () +  2 ().The convergence represents the total density perturbation along the line of sight.
The convergence and the shear are defined as gradients of the lensing potential () (3) Evaluating the derivative in Fourier space and rearranging gives a linear relation between the shear and convergence where This defines the Fourier space KS kernel (Kaiser & Squires 1993).Note that it is undefined at the origin  1 =  2 = 0, which corresponds to the mass-sheet degeneracy.The masssheet degeneracy implies that  can only be determined up to an additive constant.Physically, this is because a constant surface mass density does not cause any shear (Bartelmann & Schneider 2001).
field producing observed complex shear measurements  is described by some discretised field , is given by where (|) is the likelihood function describing data fidelity between the observed  and predictions generated from , and () is a distribution describing any prior belief about the convergence field, e.g. that it is sparse in a wavelet basis (see subsections 3.2 and 4.2).The pixels of the  map, or their representation in a particular basis, are parameters to be inferred.
In standard MCMC, a sampler will start at some model of convergence   and propose some new model

the probability of proposing 𝜅 ′
when starting at   is the same as the probability of proposing   when starting at  ′  .The proposed model is either accepted ( +1 =  ′  ) or rejected ( +1 =   ) according to an acceptance criterion.A common choice of acceptance criterion is known as the Metropolis-Hastings (MH) criterion Repeating this many thousands of times produces a chain of samples which, thanks to the acceptance criterion, is guaranteed to converge to the posterior distribution, giving the solution of the ill-posed inverse problem of inferring the convergence from the shear.A summary convergence map (e.g. the sample mean) can be calculated, as can any measure of uncertainty.This is a great strength of sampling methods, such as MCMC, although it comes at great computational cost.We note here that while this method involves predominantly forward modelling, our analysis is solving an inverse problem and hence will be referred to throughout as an inversion.

Wavelet Transform
The wavelet transform is a popular tool in image processing, having been popularised in the late 20 th century (Mallat 1989;Daubechies 1992).Since then, the theory of compressed sensing (Donoho 2006;Candès et al. 2011) showed that sparse signals can be accurately recovered from incomplete data, leading many studies in different fields using sparsity-promoting priors or regularisation in a wavelet space, including weak lensing studies (e.g.Lanusse et al. 2016;Price et al. 2021;Starck et al. 2021).
Wavelets are rapidly-decaying, wave-like functions that exist for a finite amount of time or space and form a basis in which typical signals tend to be sparse.There are many different families of wavelets, the details of which are beyond the scope of this work.The discrete wavelet transform can be seen as high and low-pass convolutions using scaled and translated versions of the "mother" wavelet, resulting in a separation of localised information at different length scales.Typically these high and low pass filters extract the high and low half, respectively, of the frequencies in the image.The outputs of the high-pass filters are the detailed wavelet coefficients, and the low-pass filters give the approximation coefficients.Repeating the process on the approximation coefficients returns detailed wavelet coefficients at a slightly larger scale than the previous pass.After each pass the wavelet coefficients are downsampled by a factor of two, as only half the frequency content from the previous scale remains.The final set of approximation coefficients are known as the coefficients of the scaling function.The scaling function is distinct from the wavelets as it accounts for the lowest frequency content, in particular at 0 frequency.
The remainder of this section gives a more mathematical description of the 2D discrete wavelet transform, largely summarising the seminal work of Mallat (1989) in which further details can be found by the interested reader.
Beginning in one dimension, consider a function  () ∈  2 (R).Consider also a continuously differentiable and exponentially decreasing scaling function (), whose Fourier transform has the shape of a low-pass filter.The projection of  onto the set of dilations and translations of  by a dyadic scale 2  for  ∈ Z gives the approximation of  at scale 2 where the angled brackets denote the inner product on R.This can be seen as the convolution of  with  evaluated at a spacing of 2 −  .The detail signal of  at scale 2  is the difference between approximations at scale 2  and the smaller scale 2 +1 .It can be shown (Mallat 1989) that the detail can again be found as the convolution of  with some scaled and translated basis functions at a spacing of 2 −  .For the detail, the basis is a wavelet function  (), whose Fourier transform is a bandpass filter.
The set of coefficients {  1  ,  2   for 0 <  ≤ } is the representation of  in the orthogonal wavelet basis.In the discrete case,  = log 2 (), where  is the length of  .
In two dimensions, we consider scaling and wavelet functions that are separable in  and .The 2D scaling function is given by (, ) = ()(), allowing for the extraction of both horizontal and vertical approximations.This gives three wavelet functions for the horizontal, vertical and corner details The approximation and details of  (, ) are again obtained by projecting onto the sets of dilated and translated scaling and wavelet functions, giving the representation of  in the 2D orthonormal wavelet basis as We omit here the details on the construction of specific scaling and wavelet functions and computing the transform in practice, as they are unnecessary to understand the massmapping method introduced in this work, and form an extensive literature in their own right.We refer the interested reader to texts such as Mallat (1989) and Daubechies (1992) for these details.

TRANS-DIMENSIONAL TREE INVERSION
In trans-dimensional MCMC, the parameterisation and the dimensionality of  is allowed to change (Green 1995).For example, the number of wavelet coefficients describing  is variable.This allows the available data to determine the complexity of the solution, rather than having this set a priori.Instinctively, it might seem that such a trans-dimensional inversion will naturally tend to more complex models, as one can obtain arbitrarily better fits to data by adding more and more model parameters.It has been shown however that results are generally parsimonious.To generalise the MH acceptance criterion, the determinant of a Jacobian matrix describing the transformation from one parameter space to another is included, i.e Significantly though, as we outline in the following section, with simple choices about the parameterisation and proposal distribution the determinant of the Jacobian is 1, effectively keeping the original acceptance criterion.
The trans-dimensional sampler we use here is also known as a birth/death sampler, where model parameters are gradually added or removed, as well have their values perturbed as in standard MCMC (Green 1995).In this work, we use a parameterisation first proposed by Hawkins & Sambridge (2015) for geophysical inversions.Described as a tree structure, nodes can be added and removed from the tree via the birth and death proposals, respectively.The nodes are arranged such that birth proposals add more detail to the model.We give some details of the structure and the prior and proposal distributions here, although for brevity we only describe the choices made for our mass-mapping application.For more details, we refer the reader to Hawkins & Sambridge (2015).The aim of using this tree structure in the mass-mapping context is to gradually add more and more small-scale detail without having to sample an exceedingly large parameter space.

The wavelet tree model
We parameterise our convergence maps using a set of  wavelet coefficients  in a tree structure, denoted by T  .The root of the tree is the scaling function coefficient, a singlepixel representation of the image.The tree then branches down to smaller scale structure.This is shown schematically in Figure 1.The  coefficients of the current tree are indicated by black dots and their connection in the tree is indicated by the red arrows.The maximum depth of the tree is the  th wavelet scale, chosen based on the desired output resolution.Our model space vector  is then given by where ⟨•, . . ., •⟩ denotes a vector and  is a unique index for all the coefficients in the tree.
3.2.The prior The prior () consists of three parts: (1) the probability of the number of coefficients, (); (2) the probability that  coefficients are arranged in the tree T  , (T  |); and (3) the prior on the wavelet coefficient values (  |T  , ).
For the probability of  we choose a simple uniform prior where we take  max = 2  , i.e. the theoretical number of wavelet coefficients with  wavelet scales.This prior is effectively uninformative, as we have no initial prior belief as to how Fig. 1.-The 2D wavelet tree parameterisation for a tree of maximum depth 3.Each coloured square corresponds to a wavelet coefficient, although only those with black dots form part of the current tree (red arrows).This current tree has  = 11 coefficients.The coefficients of each wavelet scale (  = {0, 1, 2, 3}) are separated by solid black lines and grouped by colour (blues, oranges, greens, yellows).The blue coefficient at the root of the tree is the scaling coefficient.Beneath this we have 3 wavelet coefficients at scale  = 1.Beneath each of these three coefficients we have 4 coefficients at scale  = 2, where the relation is represented by the colour shade e.g. the 4 dark green coefficients in  = 2 are beneath the dark orange coefficient in  = 1.From each of the  = 2 coefficients we then get another 4 coefficients at scale  = 3, where the relation is now represented by dashed black lines e.g. the bottom-right set of 4 dark yellow  = 3 coefficients are directly below the bottom-right dark green  = 2 coefficient.Notice how the tree (red arrows) need not extend all the way down to the maximum depth.many wavelet coefficients should be in the tree.This choice also simplifies the calculation of the proposal probabilities, which we describe in detail in subsection 3.3.
Similarly, we use a uniform prior for the arrangement of T  .Our tree is restricted and heterogeneous.Restricted trees have a maximum depth, in our case the number of wavelet scales .Heterogeneous trees vary the number of child coefficients that each parent coefficient has.For the 2D wavelet tree (Figure 1), the scaling coefficient (  = 0) has 3 children (1 each for the horizontal, vertical and diagonal detail at the next scale), and then each coefficient from  = 1 onwards has 4 children, as the wavelet transform downsamples by a factor of 2 in both the vertical and horizontal directions (subsection 2.3).Our prior is then where N , is the number of valid heterogenous tree configurations of  coefficients restricted to depth .This is calculated via a recurrence relation, the details of which we omit here in the interest of space.We refer the interested reader to Hawkins & Sambridge (2015) for details, including a fast algorithm for computing prior ratios of tree structures.We leave discussion of our value prior (  |T  , ) to subsection 4.2, as this is more physically motivated rather than imposed by our parameterisation.Suffice to say here that we use a generalised Gaussian prior on our wavelet coefficient values, with larger wavelets scales (low ) having a more Gaussian prior than smaller scales (high ).
These three priors are independent, so our overall prior () -Proposal sets for the 2D wavelet tree parameterisation.The wavelet coefficients are arranged as in Figure 1, although here they are coloured by the proposal that may be performed at that coefficient.Value, death and birth proposals can be made at the blue, orange and green coefficients, respectively.Note that the death set is a subset of the value set, and the birth set is disjoint from the other two sets.
is given by the simple product 3.3.Proposals In this birth-death sampler we have three classes of proposals: (1) value proposals, where the value of a wavelet coefficient is modified; (2) birth proposals, where a new coefficient is added to the tree; and (3) death proposals, where a coefficient is removed from the tree.We divide up our wavelet coefficient space into three sets, S  , S  , S  , which contain the coefficients at which a value, birth or death proposal, respectively, can be performed.An example is shown in Figure 2. The value set S  corresponds to the current tree, and will always include at least the root of the tree, hence |S  | ≥ 1.The death set S  is the "low-hanging fruit" of the tree, i.e. coefficients without any active child coefficients.The birth set S  is all the currently inactive child coefficients of the current tree.Notice that the death set is a proper subset of the value set, S  ⊂ S  , and the birth set is disjoint from the other two sets, S  ∩ S  = S  ∩ S  = ∅.
The value proposal distribution is given by where the numerator is the distribution from which a perturbation to wavelet coefficient   is drawn.We choose this to be a zero-mean Normal distribution with variance tuned to achieve an acceptance rate of roughly 20-40%.The denominator covers the probability of choosing to perturb the  th coefficient.Birth proposals are performed by selecting a coefficient  from the birth set S  and initialising it with a value drawn from the value prior.In the case where the birth set is empty, i.e.
the current tree spans all coefficients, the proposal probability is zero.Hence, we have Death proposals simply involve choosing a coefficient  to remove from the death set S  , so The acceptance criterion (Equation 10) requires the reverse proposal distributions for each of these proposals (| ′ ).The value proposal distribution is symmetric, so the reverse value proposal is equal to the forward proposal.As for the birth and death proposals, these are the reverse proposals of each other, i.e. the reverse of proposing the birth of coefficient  from S  is proposing its death from S ′  , where the prime here denotes the set after the proposal.So the reverse proposals are 3.4.Acceptance criteria Recall the acceptance criterion (Equation 10) depends on the prior, likelihood and proposal ratios, and a Jacobian.The likelihood ratio remains the same regardless of the proposal, and we denote this as L ( ′ , ) = ( ′ | ′ ) / (|).The prior and proposal ratios depend on the type of proposal performed.
For value proposals the tree structure does not change, so the prior ratio is simply whether or not the proposed coefficient value is more likely than the current value As for the proposal ratio, the reverse distribution is equal to the forward distribution so the ratio is 1.Since there is no change in dimension of the tree, the Jacobian will always be the identity, with determinant equal to 1. Hence the acceptance criterion for value proposals is For birth proposals, the tree structure changes.Using our uniform prior for the number of coefficients in the tree, this part of the prior ratio is unity, ( + 1) / () = 1.The prior values also all cancel out except at the proposed new coefficient .So the overall birth prior ratio is, dropping the conditional dependencies for clarity Using reverse distributions above, the proposal ratio is For the Jacobian, it is a simple case of Green (1995)'s dimension matching without any transformation of random variables.We can denote our parameter space  as a set of  tuples each containing a unique index   denoting a position in the tree and the value of the wavelet coefficient at that position   .Dimension matching then requires us to sample random variables  and  for the position and value of the new coefficient, respectively, and then setting the proposal  ′ to be some function of  and .Using the proposals defined previously, that function is simply the identity.For all the pre-existing parameter ( = 1 . . .) there is no change, and for the new coefficient we set ( +1 ,  +1 ) = (, ), hence there is no transformation of the random variable and the Jacobian is the identity matrix.So the birth proposal acceptance criterion is Similar reasoning and considering that the Jacobian of the death proposal is the inverse of the birth proposal Jacobian leads to the death proposal acceptance criterion, 3.5.The full trans-dimensional MCMC algorithm Starting from a randomly initialised convergence model  0 with  = 1, trans-dimensional MCMC iterates for N iterations, for sufficiently large N.At each iteration , the type of proposal is chosen with probability (birth), (death) and (value).These probabilities can be set arbitrarily, subject to the constraints that (birth) = (death) and (birth) + (death) + (value) = 1 (Hawkins & Sambridge 2015).Having selected the proposal type, a new model  ′ is proposed according to corresponding proposal distribution ( ′ |  ) (Equations 15, 16 and 17).The proposal is then accepted with probability ( ′ ,   ) according to the corresponding acceptance criterion (Equations 22, 25 and 26) by setting the next sample in the chain  +1 =  ′ .Note that at each iteration, only a single model parameter (wavelet coefficient   ) is affected.In practice, the first  burn iterations are discarded and only every  thin th sample thereafter is kept.This avoids saving too many initial samples that are far from the posterior peak and reducing the correlation between samples, while still allowing for adequate sampling of the posterior.Algorithm 1 summarises the trans-dimensional MCMC algorithm.

MASS-MAPPING WITH TRANS-DIMENSIONAL TREES
In this section we give details of how we use the transdimensional tree structure described previously for massmapping.In particular we describe the forward operator and choice of prior used for the following sections.

Forward Operator
Our trans-dimensional MCMC samples the wavelet coefficients  of a convergence map .The convergence map is then  =  −1 , where  −1 is the inverse wavelet transform.To use the KS kernel in Equation 4, denoted by  we need forward and inverse fast Fourier transforms, ,  −1 , respectively.As such, obtaining the shear from sampled convergence wavelet coefficients is given by The input data for our inversions are gridded galaxy shear measurements.Gridding is necessary to allow the galaxy ellipticity due to shear to emerge out of the shape noise due to intrinsic ellipticity.We make the common assumption that the intrinsic ellipticities of galaxies are Gaussian randomly oriented with zero mean, and thus we take the observed shear in a small region of sky (pixel) to be the mean of the ellipticities of the galaxies in that region.However, as a result of the gridding we are left with fewer data points, and likely insufficient data to produce a high-resolution convergence map.Further, depending on the coarseness of the gridding it is possible that some pixels contain no galaxy measurements, as is common in weak lensing surveys, which may lead to signal leakage between Fourier modes.We assume that the shear measurements in each pixel follow a Gaussian distribution of known variance  2 .As such we set the likelihood of observing shear measurements  for a given convergence map wavelet coefficients  to be ) where Φ is the operator defined in Equation 27and   is the data covariance matrix.We typically assume no correlation between neighbouring pixels, and as such   is a diagonal matrix with all the  2  along the diagonal.We also assume that the real and imaginary parts of the  2  are equal.We note that these assumptions can break down in the presence of systematics such as intrinsic alignment.

Generalised Gaussian Prior on Wavelet Coefficients
Strictly, sparsity is measured by the ℓ 0 -norm, which counts the non-zero elements in a vector.However, this norm is non-convex, and it has been shown that the convex ℓ 1 -norm, or equivalently, the Laplace distribution, produces comparable results, and so is now widely used for sparsity promotion (Donoho 2006;Candès et al. 2011).However, thanks to the multiresolution nature of the wavelets we can impose different priors for different length scales.In particular, we expect the wavelet coefficients of larger scale structure to be more Gaussian than at smaller scales.Starck et al. (2021) where Γ(•) is the gamma function,  is the mean,  is the scale parameter (c.f.standard deviation) and  is the shape parameter.For  = 2 this gives the Gaussian distribution,  = 1 gives the Laplacian, and as  → ∞ this tends to a uniform distribution in the range [ − ,  + ].McEwen et al. (2017) suggested learning the scale and shape parameters for each wavelet scale based on simulation images.Figure 3 shows the fitted GGDs of wavelet coefficients at each scale for the Bolshoi 7 & 8 galaxy cluster simulation convergence images (Klypin et al. 2011).The fitted scale and shape parameters are also shown, and clearly the larger scale wavelet coefficients are more Gaussian than the smaller scale coefficients.We use the distributions in Figure 3 as loose guide for tuning the shape and scale parameters for each wavelet scale for the application to simulation data in the following section.It would be unreasonable to use the exact distributions as they would not be known in the real data case.We note that this prior implies that the individual wavelet coefficients are considered independent.
In the wavelet tree parameterisation, the root of the tree is a single pixel approximation of the  map, i.e. the image mean.However, the mass-sheet degeneracy means  can only be determined up to an unknown constant ,   =  + (1 − ).We can then set the prior of the tree root to be extremely tight around 0, as it cannot be determined from the data.

APPLICATION TO SIMULATIONS
In this section we present the results of tests of our transdimensional tree method on datasets generated from simulated cluster-scale convergence maps.

Data
We begin with a ground truth convergence map , extracted from the Bolshoi N-body simulation (Klypin et al. 2011).We use the same simulation catalogues as Lanusse et al. (2016) and Price et al. (2021), extracted using the CosmoSim1 website.This assumed a redshift of  = 0.3, a 10 × 10 arcmin 2 field of view, and the convergence maps have been normalised with respect to lensing sources at infinity.The ground truth image is shown in Figure 4, sampled on a 256 × 256 pixel grid.
From the ground truth convergence maps we generate noisy synthetic shear data  from where  is drawn from a zero-mean Gaussian distribution N (0,  2 ),  2 is the data covariance determined by the number of observed galaxies in a pixel and an assumed intrinsic galaxy ellipticity dispersion of 0.37 (Price et al. 2021), where  is the expected number of galaxies in a pixel for a given number density of galaxies per arcmin 2 ,  gal .For example, Euclid is expected to be able to see about  gal ∼ 30 galaxies per arcmin 2 (e.g Laureijs et al. 2011).To make the data somewhat more realistic, we apply a random mask to the noisy shear data whereby we mask out 1% of pixels to simulate regions with no galaxy observations.

Method
We invert the synthetic shear data for a convergence map using the trans-dimensional tree method detailed in Section 3. The wavelets we use are the Cohen-Debauchies-Feauveau 9/7 wavelet family (Cohen et al. 1992), as these produced the best results for Hawkins & Sambridge (2015).We run a single Markov chain for at least 10 6 steps and check that the likelihood and number of model parameters  has converged.If needed we restart the chain from where it ended and extend the chain for as long as necessary.The generalised Gaussian prior parameters (shape and scale) are tuned for each wavelet scale, except the lowest scale which represents the mass-sheet degeneracy, so as to achieve an appropriate acceptance rate, typically around 20-50%.Thus these are not free parameters during inference.Again, the values obtained by fitting a GGD to the ground truth Bolshoi simulations (Figure 3) are used as a loose guide, and we ensure that our priors are wider than these fitted distributions.It is conceivable that these parameters be determined in a hierarchical manner (e.g.Bodin et al. 2012;Alsing et al. 2015), however we leave this for future work.As in Price et al. (2021) we evaluate the similarity between our convergence solutions (mean or best-fitting) and the ground truth using two quantities: the signal-to-noise ratio (SNR) in dB to assess the overall difference and the Pearson correlation coefficient () to assess the structural similarities.
Here  * denotes our solution convergence map and ⟨•⟩ denotes an average over the  pixels.

Recovery of a clean high-resolution mass-map
Figure 4 shows the ground truth map, the mean and highest posterior (MAP estimate) trans-dimensional MCMC point solutions and the range of the highest posterior density region at the 99% credible interval level.This is a very clean data example.Both the mean and highest posterior point estimates recover the three main high convergence regions well, although some of the fainter sources are missing, possibly lost in the noise or smoothed out when averaging over many samples in the case of the mean solution.The highest posterior point estimate recovers nicely the peak at the core of the lowermost cluster, which seems to have been lost in the mean solution.In terms of the overall SNR and correlation, however, the mean solution is better than the MAP.
At a given pixel, the quantified uncertainty is represented by the two-tailed 99% credible interval range, symmetric about the peak of the histogram of values that the pixel takes in our MCMC chain.Assuming a unimodal histogram, this is the highest posterior density region (HPDRange in Figure 4).The Bayesian credible interval is an interval in which the pixel value falls with probability 99%.The credible interval range is thus calculated as the range between the 0.5 th and 99.5 th percentiles of individual pixel values of the convergence map.Credible intervals are widely used Bayesian measures of uncertainty (e.g.Gelman et al. 2013;Pereyra 2017;Price et al. 2020a;Marignier et al. 2023).The lateral extent of the three main clusters in the uncertainty map are slightly larger than those in the point solutions.As well as highlighting the three main peaks, where the detailed structures are not quite resolved, the uncertainty map picks out some of the smaller, fainter structures that have been missed in the point solutions.For example, a faint source in the top left quadrant is identified, as are faint structures in the vicinity of topmost and lowermost main peaks.We emphasise here that while we have chosen to show a credible interval as our measure of uncertainty for its Bayesian interpretation, with adequate sampling of the posterior any measure of uncertainty can be obtained.
Figure 5 shows a histogram of the number of active tree nodes (non-zero wavelet coefficients, ) for all the samples in the trans-dimensional MCMC chain, as well as the evolution of the likelihood (|) and the number of wavelet coefficients  as the chain progresses.The number of coefficients generally increases, as designed by the parameterisation.As the number of coefficients increases, the likelihood expectedly improves.However, the number of parameters does not increase indefinitely and converges at around 250 coefficients.
In this case, the ground truth is sampled on a 256 × 256 pixel grid, meaning the maximum potential size of the parameter space is  max = 65 536.This is a massive reduction of the parameter space.

Low-resolution and high-noise inversions
In the example shown previously, the noise level was kept relatively low, using   = 5000 galaxies per arcmin 2 which corresponds to just under 8 galaxies per pixel.The Euclid telescope is expecting to observe around 30 galaxies per arcmin 2 of sky (Laureijs et al. 2011).For this 10 × 10 arcmin patch of sky, that would correspond to on average less than 0.05 galaxies per pixel.During our initial experiments, we found this noise level to be too high for our method to work.Adding or removing parameters would cause so little a difference to the likelihood (data fit) that almost any proposal would be accepted, and since births and deaths are proposed in the same proportions, the tree would never grow.
The solution that we found to beat down the noise is to decrease the resolution.At 32 × 32 pixels, a Euclid-like noise level would correspond to about 3 galaxies per pixel for this patch of sky. Figure 6 shows the results (mean and uncertainty) of our trans-dimensional inversions at this lower resolution for varying noise levels.Also shown are the KS solution and the KS solution with optimum smoothing.The smoothing uses a simple 2D Gaussian kernel, the optimum size of which was determined by a simple linear search of the kernel size  and identifying which kernel produced the best SNR.This would obviously not be possible in practice with real data, as it requires knowledge of the true convergence field.As such, this optimum solution represents the top-end of what could be possible with KS.At all noise levels, trans-dimensional MCMC performs better than the optimally smoothed KS solution in terms of SNR and correlation with the ground truth.Visually, even at the highest noise levels, the lowermost peak is more tightly constrained by MCMC.The additional uncertainty information obtained from MCMC provides further constraints on the locations of peaks, although, somewhat surprisingly, the uncertainty is more laterally spread in the low noise cases than in the high noise cases.The size of the uncertainty does decrease as the noise as noise decreases, as expected.
Figure 7 shows the results for the highest noise level ( gal = 30), zooming in on the two main clusters.We have taken a 8 × 8 pixel box around the high mass cluster at the bottom of the image (red box and middle row in Figure 7), and a 12 × 12 pixel box around the low mass clusters towards to the top (green box and bottom row in Figure 7).The signalto-noise ratio and correlation coefficients presented have been recalculated within the spatial extent of the respective clusters, and the optimum smoothing for each cluster was also found as described previously for each cluster.For the cluster with peak convergence, trans-dimensional MCMC again outperforms the optimally smoothed KS method in terms of both overall difference and structural similarities with the ground truth.For the low mass cluster, the recovery is slightly worse than the opti- (bottom) Evolution over the Markov chain of the likelihood (blue) and number of wavelet coefficients (orange).Note that the samples in the bottom plot have been thinned, i.e.only every 100 th sample is shown, but have not been thinned in the top plot, i.e. every sample is shown.This explains why the top plot contains more samples (area under the histogram) than the bottom plot (length of  axis).mally smoothed KS in terms of the magnitude of convergence, resulting in a slightly lower SNR, although the correlation remains higher.This shows that our trans-dimensional MCMC is competitive, if not better, than KS with ad-hoc smoothing for a range of cluster masses.
In Figure 8 we show the power spectra for the solutions shown in Figure 6 compared to the ground truth.At all noise levels, and particularly at the highest Euclid-like noise level, the spectrum of the mean trans-dimensional MCMC solution is closer to that of the ground truth than the KS solutions.Importantly, this is also the case at smaller length scales, where KS is significantly affected by noise and smoothing has the undesirable effect of removing small scale information.The ground truth spectrum is also shown to be within the uncertainties obtained by MCMC, whereas the KS solutions are inconsistent with the MCMC solution at small length scales.Uncertainties are seen to be larger at large scales than at small scales.This is likely a result of the mass-sheet degeneracy, despite there being a tight prior around 0 for the scaling coefficient to alleviate this issue.These higher uncertainties are thus probably due to the larger scale wavelets which still contribute to the background mean and for which we have wider prior distributions (see Figure 3).

DISCUSSION
The trans-dimensional MCMC algorithm is able to significantly decrease the size of the parameter space, searching only a few hundred parameters rather than tens of thousands in the high resolution case (Figure 4).The uncertainties also provide further constraints on the lateral extent of the peaks of the convergence field.All our results also compare favourably to the standard KS method, in terms of the overall fidelity of the reconstruction and in its ability to fully quantify uncertainties (Figure 6), the recovery of low mass structures (Figure 7) and at small length scales (Figure 8).Importantly, our results compare favourably to the optimally smoothed KS solution, which is unobtainable in practice.Furthermore, this sampling problem is far from computationally expensive, thanks to the reduction in parameter space.All the inversions performed were conducted on a 2020 MacBook Pro with an Apple M1 processor, with the high resolution example taking only 5 hours.
With a new method that more accurately recovers mass maps from noisy shear data and also quantifies full uncertainty, this opens up an exciting prospect of new results relating to dark matter distributions in galaxy clusters.This method could be applied to high-resolution shear catalogues from the Hubble Frontier Fields (Koekemoer et al. 2014) or new data from Euclid (Laureijs et al. 2011) for new mass-maps and comparisons with, for example, data from the Chandra X-ray Observatory (Weisskopf et al. 2000) to investigate the collisional nature of dark matter, as in Harvey et al. (2015).Harvey et al. (2015) used a parametric model to reconstruct the density distribution of their colliding galaxy systems (Navarro et al. 1997;Jullo et al. 2007).Our non-parametric method will thus relax some of the physical assumptions made in their density reconstructions, providing additional constraints and validation of their findings.
The main limitations of this trans-dimensional sampler are two-fold.Firstly, convergence to the peak of the posterior is difficult to determine.In some cases, the sampler may seem to have converged, then a new tree node is introduced that causes a large jump in posterior probability, bringing the sampler into a new region of parameter space to be explored.This is a general problem with MCMC methods.While there is some theoretical understanding about the convergence properties of various probabilistic methods (e.g.Roberts & Smith 1994;Mengersen & Tweedie 1996;Pereyra 2016;Durmus et al. 2017), how this translates into number of iterations is not straightforward, and so in practice it is often the case of simply continuing for as long as possible.A common choice is also to Fig. 3.-Fitted generalised Gaussian distributions of wavelet coefficients for the Bolshoi 7 & 8 N-body galaxy cluster simulations, normalised to have unit height.The scaling function coefficients and the largest wavelet scale coefficients  = 1 have been omitted, as there are not enough coefficients for a fit to be truly meaningful.The fitted scale, , and shape,  parameters are indicated.field.Alternatively, McEwen et al. (2017) suggested fitting a generalised-Gaussian distribution (GGD) to each wavelet scale.The probability density function of a GGD is given by Fig. 4.-(top left) Ground truth convergence field from which synthetic shear measurements were obtained.(top right) Mean solution from transdimensional MCMC.(bottom right) Highest posterior sample (estimate of the MAP solution) from trans-dimensional MCMC.(bottom left) Size of the highest posterior density region as a measure of uncertainty.All images show a 10 × 10 arcmin 2 field of view.
Fig. 5.-History of the MCMC chain that produces the results in Figure 4. (top) Histogram of number of wavelet coefficients.(bottom) Evolution over the Markov chain of the likelihood (blue) and number of wavelet coefficients (orange).Note that the samples in the bottom plot have been thinned, i.e.only every 100 th sample is shown, but have not been thinned in the top plot, i.e. every sample is shown.This explains why the top plot contains more samples (area under the histogram) than the bottom plot (length of  axis).
Chain of length  of model parameter vectors  ∈ R  Propose new sample  ′ from corresponding  ( ′ |   ) Calculate acceptance probability corresponding ( ′ ,   ) Input: Number of iterations  Data: Data vector  ∈ C  Result: ′  ←  + 1