Capse.jl : EFFICIENT AND AUTO-DIFFERENTIABLE CMB POWER SPECTRA EMULATION

,


INTRODUCTION
The experimental cosmological landscape is extremely rich and promising.Upcoming millimeter-wave telescopes, including the Simons Observatory (SO, Ade et al. 2019) and CMB-S4 (Abazajian et al. 2019), are building on the legacy of the Planck satellite (Aghanim et al. 2020a), the South Pole Telescope (SPT, Sobrin et al. 2022) and the Atacama Cosmology Telescope (ACT, Aiola et al. 2020) to deliver precise, sample variance dominated measurements of cosmic microwave background (CMB) anisotropies down to arcminute scales in intensity and polarization.Meanwhile, photometric and spectroscopic galaxy surveys such as the Dark Energy Survey (Abbott et al. 2018), KiDS (de Jong et al. de Jong et al.), BOSS (Dawson et al. 2013), DESI (Levi et al. 2013), Euclid (Laureijs et al. 2011), the Vera Rubin Observatory (Abate et al. 2012), the Nancy Grace Roman Space Telescope (Spergel et al. 2015), and SPHEREx (Crill et al. 2020) are poised to accurately measure the positions and shapes of billions of galaxies, reconstructing the large-scale structure (LSS) of the Universe out to high redshifts.Fueled by a dramatic increase in the data throughput, these experiments promise to map out a large portion of the Hubble volume to put under stress the ΛCDM model, delivering groundbreaking insights into the fundamental properties of the cosmos.
However, analyzing the large amounts of data generated by these experiments poses significant challenges.In most cases, the analysis is carried out in terms of summary statistics, such as the two-point function in either real or harmonic space.At the heart of Bayesian inference pipelines, Einstein-Boltzmann codes such as CAMB (Lewis et al. 2000) or CLASS (Blas et al. 2011) solve for the evolution of cosmological perturbations and calculate CMB/LSS observables, which are in turn compared to the data in the likelihood evaluation.Depending on the dimensionality and the topology of the parameter space, this operation is typically repeated 10 4 -10 6 times before convergence thresholds are reached; considering that a call to a Boltzmann solver might require from a second to a minute in case of high-precision settings, these numbers show the high computational cost of Monte Carlo Markov Chains (MCMC) analysis.
Emulation has emerged as a highly promising technique for accelerating cosmological inference.It consists in developing surrogate models that approximate the outputs of computationally expensive models at significantly lower computational costs.This approach offers a twofold advantage: it expedites calculations and enables the utilization of alternative optimization and sampling methods, such as those based on gradients.In recent years, emulators have proven successful in various domains with computationally intensive forward models, including cosmology.Early contributions in this area can be traced back to the work of Jimenez et al. (2004) and Fendt & Wandelt (2006), which presented the initial applications of emulators based on polynomial interpolation within the field of cosmology.The first emulator based on neural networks (NN), called CosmoNet, was introduced in Auld et al. (2007).Recent advancements include the work of CosmicNet (Albers et al. 2019;Günther et al. 2022), with emulators targeting the replacement of the most computationally demanding components within Boltzmann solvers.Additionally, Nygaard et al. (2023), Günther (2023), and To et al. (2023) adopted an iterative approach for emulator training to reduce the overall computational resources required.Currently, the most precise CMB emulator (i.e.matching the CAMB highprecision settings) that accurately coveres a wide parameter space is CosmoPower (Spurio Mancini et al. 2022;Bolliet et al. 2023), which directly emulates the CMB angular power spectra.
In this paper, we present a novel emulator of the CMB temperature, polarization, and lensing potential anisotropies angular power spectra, Capse.jl 1 , which differs from previously developed emulators in two fundamental ways.On the one hand, we propose and implement a more efficient preprocessing, i.e. the operations performed on the training data before being fed to the NN for the training, and test a different technique for data compression.On the other hand, we employ a specialized, performant NN framework in Julia, which results in our emulator returning predictions in around 50 µs.Moreover, Capse.jl can be trained in an hour's time on a standard 8-cores CPU to cover a wide range of cosmological parameters, and down to scales that will be mapped by upcoming experiments.These characteristics make Capse.jl the CMB power spectra emulator with the shortest evaluation time available as well as extremely cheap to train.In addition, we also implement fully differentiable CMB likelihoods.Coupled to our emulators, these methods enable the use of gradient-based inference methods.The combination of these state-ofthe-art inference methods with Capse.jlallows us to analyze Planck data in less than one second and ACT data in ∼ 10 seconds.
This paper is structured as follows.In Sec. 2 we de-1 https://github.com/CosmologicalEmulators/Capse.jlscribe the architecture, the preprocessing, and the training procedure of our emulators.In Sec. 3 we describe the way we construct our likelihoods.In Sec. 4 we describe the sampling algorithms employed in this work.In Sec. 5 we show the main results of our work, namely the precision and the speed of our emulators coupled with differentiable likelihoods.In Sec. 6 we compare our emulators with similar works present in the literature.Finally, we conclude drawing the conclusions of this study in Sec. 7.

CAPSE EMULATOR
The core idea behind surrogate models is quite simple: replacing a computationally expensive function with a sufficiently accurate approximation that is cheaper.Although creating an emulator is conceptually straightforward, there are several design choices involved in the process which can have a drastic impact on the final performance.These choices include the type and architecture of the surrogate model, the preprocessing procedure, the application of dimensional reduction techniques, and potentially incorporating domain knowledge.
To achieve optimal performance and train a fast and accurate emulator while conserving computational resources, all of these choices become significant.In this section, we provide a detailed description of our Capse.jlemulator, focusing on the aspects that differentiate it from other emulators present in literature.We conclude by discussing the accuracy of the trained emulators using a validation data set that covers the entire range of emulation.

Architecture, preprocessing and training
The NN employed for our emulators is a multi-layer perceptron (MLP) with a number of input features that corresponds to the number of cosmological parameters considered, 5 hidden layers with 64 neurons each and a hyperbolic tangent activation function.Our emulators are written in the Julia language (Bezanson et al. 2015) and employ the NN implemented in SimpleChains.jl.While frameworks like PyTorch, TensorFlow, and JAX offer general-purpose capabilities to cover a wide range of use cases, from simple MLPs to complex Large Language Models and from models running on a single CPU to those using grids of TPUs, SimpleChains.jlhas a way more limited focus: it excels at running small NNs on CPUs.By narrowing down the scope, the library prioritizes performance over generality, enabling several choices that improve computational efficiency.This results in a significant boost compared to more traditional frameworks, as demonstrated in previous comparisons with SimpleChains.jl.2 A key aspect of our methodology is the preprocessing of training data, which significantly impacts emulation performance.In this work, we perform min-max normalization on each input and output feature.This process scales all features to fall within a [0, 1] range, a technique known to enhance emulation performance (Nygaard et al. 2023).Furthermore, we leverage domain knowledge, exploiting the dominant linear dependence of the C ℓ 's on the scalar amplitude A s .To account for this, we divide each output feature for the corresponding value of A s , while still treating A s as an independent parameter within the NN.This approach analytically handles the linear scaling, allowing the NN to focus on learning the more nonlinear feature. 3We found that this enhances the emulation accuracy given the same amount of training resources.This improvement can be easily explained by the fact that, given the ln 10 10 A s emulation range, these preprocessing steps reduce the range of the output features by almost a factor of 3, hence improving the emulation performance.We checked it explicitly: the A s rescaling enhances the emulator accuracy by a factor of ≈ 3, almost the same as the C ℓ 's dynamical range reduction.
The nonlinear corrections were computed using HMCode (Mead et al. 2020).The emulation space is also chosen to exactly match that of Bolliet et al. (2023), which allows us to make a fair comparison with their results; the details can be found in Table 1.We trained a NN for each angular spectrum considered in this work (namely T T, T E, EE, ϕϕ), using a standard laptop with an 8-core CPU using the ADAM optimizer (Kingma & Ba 2017) with a mean square error loss function for 50,000 epochs.
In this paper, we explore two different emulation approaches.In the first approach, the output of the NN is a set of CMB power spectra C ℓ 's evaluated on the multipole grid from ℓ min = 2 to ℓ max = 5000 with a step of ∆ℓ = 1.This emulator delivers predictions for a single power spectrum in a meantime of 40 µs.This improvement is truly remarkable when compared to the computation time of predictions using a Boltzmann solver with high-accuracy settings, which typically takes around O(60 s).Consequently, our emulators achieve speeds that are approximately 1,000,000 times faster.
The second approach we consider involves a dimensionality reduction technique, the Chebyshev polynomial decomposition.The key idea behind this approach is to project the ℓ-dependence of the C ℓ 's onto the Chebyshev polynomial basis: where a n represents the n-th coefficient that carries the dependence on the cosmological parameters θ, T n denotes the n-th Chebyshev polynomial, and N max is the highest degree of the polynomials employed in the decomposition. 4The emulator is then trained to reproduce the cosmological dependence of the a n (θ) coefficients.We choose to explore this approach for the nice mathematical properties of Chebyshev interpolation; we refer the interested reader to Trefethen (2012) and Wang (2023) for a detailed discussion.The advantages of this polynomial approach are twofold.First, it significantly speeds up the training process, since the number of output features of the NN is reduced.We have found that using a polynomial order of N max = 48 provides accurate emulation while simplifying the complexity of the neural network.This reduction in complexity not only lowers the training cost but also improves the emulator speed, enabling predictions to be generated in approximately 15 µs.This represents a 2.5fold increase in speed compared to the previous scenario.
The second, perhaps less obvious, advantage stems from this simple observation.By performing a linear decomposition of the C ℓ 's and noting that the basis functions do not depend on the cosmological parameters, we can leverage this linear decomposition in the likelihood computation.If the operations needed to perform the likelihood computation can be represented as a linear operator L acting on the theory vector, we can utilize the linear decomposition as follows: where we have defined: Tn ≡ LT n . (3) If the linear operator L does not depend on the cosmological parameters, as it typically is the case, we can compute the action of L on the basis function once and store the results.As we will demonstrate later, this approach leads to significant speed improvements in the analysis.

Accuracy tests
We carry out two types of tests to assess the performance of the emulators.The first check involves comparing the emulator's predictions with the corresponding ground truth values.This comparison allows us to measure the discrepancy between the emulator and the Boltzmann solver output.In this section, we will thoroughly discuss this validation approach.The second check involves running a Markov Chain Monte Carlo (MCMC) analysis using specific datasets.The MCMC chains produced by the emulator are then compared with the chains mials since they are defined on the interval [−1, 1].The rescaling is automatically handled by the Chebyshev projection library employed in our analysis, FastChebInterp.jlobtained from the Boltzmann code.This comparison serves as an additional validation step to assess the emulator's accuracy and reliability.The details and results of this MCMC analysis will be presented in Sec. 5.
We use CAMB to generate a validation dataset consisting of 20,000 combinations of the cosmological parameters drawn from the same emulation range of the training dataset (see Tab. 1).We emphasize that these specific combinations are not present in the training dataset.Our objective is to assess the accuracy of the emulators by comparing their predictions to the ground truth values.To quantify the emulation error, we calculate the difference between the emulated C XY where XY ∈ [T T, T E, EE, ϕϕ] represent the CMB temperature, polarization, and lensing potential anisotropies power spectra.
To perform this test, we need both the ground truth spectra for computing the emulation error and the expected spectra variance σ ℓ .The variance can be computed, under the assumption that both [X, Y ] ∈ [T, E, ϕ] behave as Gaussian random fields, as (e.g., Knox 1995;Kamionkowski et al. 1997): (5) where f sky and N XX ℓ are respectively the observed sky fraction and the noise power spectrum of field X for the experimental configuration under consideration.We consider two representative upcoming ground-based CMB surveys, Simons Observatory 5 (Ade et al. 2019) and CMB-S4 6 (Abazajian et al. 2019).We refer the reader to Bolliet et al. (2023) and references therein for details regarding the surveys sensitivities.In both cases, we use projected noise curves obtained after foreground cleaning and including multi-frequency information from the Planck satellite.For both experiments we assume an observed sky fraction of f sky = 0.4; the results can be seen in Figs. 1 and B1.
For temperature and polarization we use the following LAT comp sep noise/v3.1.0curves: SO LAT Nell T atmv1 goal fsky0p4 ILC CMB.txt and SO LAT Nell P baseline fsky0p4 ILC CMB E.txt.
For lensing we choose the nlkk v3 1 0 deproj0 SENS2 fsky0p4 it lT30-3000 lP30-5000.dat in the LAT lensing noise/lensing v3 1 1/ folder.The degraded EE 2-point function performance in the ℓ < 20 region arises because of two different factors.First and foremost, we only consider cosmic variance over this multipole range since S4 will not be able to image the largest angular scales at ℓ ≲ 40 and therefore they do not provide noise curves on this range.The second effect comes from the scaling of the 2-point function with respect to τ , which is proportional to τ 2 for ℓ < 30.In fact, the dynamical range of the C EE ℓ in this specific multipole range, across all the cosmologies considered, is 30 times bigger than the one outside.We expect that, when including the τ rescaling in a future work, we will be able to improve the low-ℓ accuracy as well.If we exclude the C ℓ 's with ℓ < 13, the results are even better, with an emulation error that is below 0.08 σ for 99% of the points in the validation dataset.

COSMOLOGICAL INFERENCE FRAMEWORK
After providing a detailed description of the Capse.jlemulator, our focus shifts to leveraging its CMB power spectra predictions for cosmological inference.Specifically, we are interested in drawing samples from the posterior distribution of the model parameters θ given the data d, P (θ|d), after specifying a likelihood function L(d|θ) and the priors Π(θ).To this end, we use the Julia Bayesian statistical inference library Turing.jl(Ge et al. 2018).
Turing.jl is a Probabilistic Programming Language (PPL).Unlike traditional programming languages, which are designed for deterministic computations, PPLs are designed to explicitly incorporate probabilistic constructs.This allows the user to explicitly declare the priors for the input parameters and write their relationship to the observations within a statistical mode.Moreover, the user can also declare the distribution of the observations.Then, Turing.jluses this information to automatically draw an expression for the likelihood by conditioning the model on the observations.
Below is a Julia code snippet demonstrating how a likelihood is implemented using Turing.jl:CMB-S4 (ΛCDM) Figure 1.: Comparison of the relative differences between Capse.jl and high-accuracy predictions from CAMB for the four 2-point statistics emulated in this study.The differences are normalized to the forecasted CMB-S4 statistical uncertainties (f sky = 0.4) and measured in units of cosmic variance for ℓ < 40.The shaded regions represent the 68%, 95%, and 99% percentiles of the respective distributions.This is slightly different from the actual likelihood being used, which performs a reparametrization as explained in Appendix A.
In this work, we use the distributions listed in Tab. 2 as priors for the input parameters of the Capse.jlemulator.The only exception is τ , for which we employ a Gaussian prior, stemming from the analysis of the low-ℓ EE angular spectrum.Moreover, since both Planck , ACT, and SPT-3G measurements are Gaussiandistributed (see Sec. 5.1), we use Turing.jl to draw a Gaussian likelihood of the shape: where t(θ) is a theory vector computed as a function of the model parameters θ and C is the covariance matrix of the data vector d.
While the bottleneck is typically the computation of the theory vector t(θ), even for the other emulators described in the literature, this is not true anymore in our case, especially for the data and theory vectors of the Planck and ACT analyses.The binning of the C ℓ 's and the χ 2 computation are more expensive by a factor of 7 than our emulators computations.
In order to overcome this issue, we follow two strategies.The first is the one depicted in Eq. 2. As anticipated in Sect.2.1, all the operations required to compute the Planck and ACT likelihoods can be represented as a linear operator L on t(θ).In this case, we can compute its action on the basis function once Tn = LT n and store it.During the MCMC, we multiply Tn with a n (θ) and then compute the χ 2 .This procedure, while speeding up the analysis, does not introduce any theoretical error: the computation of the binned C ℓ 's using the two approaches matches up to machine-precision.
A further optimization that we can introduce is using MOPED, a linear data compression algorithm (Heavens et al. 2000(Heavens et al. , 2017)).Given a n-dimensional data vector whose theoretical prediction depends on m parameters, MOPED generates n m-dimensional data vector, b i .Moreover, if the noise of the data does not depend on the parameters of the model, as in the case of Planck and ACT measurements, the original and compressed data vectors can be shown to have identical Fisher information matrices with respect the parameters of the model (Heavens et al. 2000).The MOPED compression vectors b i can be computed using: where t ,i is the derivative of the theory vector with respect to the i-th parameter and F ii is the i-th element on the diagonal of the Fisher matrix F : As stated in Campagne et al. ( 2023), it is pretty straightforward to evaluate both the Fisher matrix and the t derivatives using automatic differentiation (AD).Therefore, we use MOPED to perform a compression of the likelihood shown in Eq. 6.This allows us to transform a relatively expensive n-dimensional Gaussian likelihood into the product of m one dimensional unit-variance Gaussian likelihoods: where M i is the MOPED compression matrix for the i-th parameter of the model.However, unlike our first strategy, using the MOPED compression changes the actual likelihood being targeted, opening the possibility for changes the results of our analysis.

SAMPLERS
We now outline the methods employed to explore the likelihoods introduced in Sec. 3 and derive parameter constraints for the Planck , ACT, and SPT-3G datasets.The algorithms employed, all based on leveraging the likelihood gradient, are: • NUTS, a version of the Hamiltonian Monte Carlo sampler (Hoffman & Gelman 2011).This is considered the state-of-the-art algorithm to be employed in MCMC analysis.
• Pathfinder, a variational inference methods (Zhang et al. 2022).Although less robust than the previous methods, as it lacks of asymptotic convergence properties, Pathfinder excels at quickly delivering draws from the typical set, that can be representative of the posteriors in the case of Gaussian or mildly non-Gaussian posterior, or used to initialize standard MCMC.
In this section we describe all these methods, while addressing the interested reader to the relevant works in the literature.

Hamiltonian MonteCarlo
Hamiltonian Monte Carlo (HMC) (Betancourt 2018) is a MCMC algorithm that explores a parameter space by simulating the dynamics of a Hamiltonian system.The fundamental idea behind HMC is to introduce auxiliary momentum variables that are independent of the target function.The joint distribution of the position (i.e. the original parameters) and momentum variables is then defined by a Hamiltonian function, which governs the dynamics of the system.The Hamiltonian is typically chosen to be the sum of the potential energy, defined as the negative logarithm of probability density of the likelihood, and the canonical kinetic energy of the momentum variables.
In each HMC iteration, a proposal is generated by simulating the dynamics of the system for a fixed number of steps using a numerical integration scheme.The acceptance probability of the proposal is then computed using the Metropolis-Hastings (MH) algorithm based on the Hamiltonian energy of the sample.Then, it can be shown that the target distribution can be returned by marginalizing the momenta variables: One of the challenges of HMC is choosing the step size and number of integration steps, which can have a significant impact on the performance of the algorithm.To address this challenge, the No-U-Turn Sampler (NUTS) (Hoffman & Gelman 2011) algorithm was proposed as an extension of HMC.NUTS addresses this issue by introducing a recursive algorithm that determines the optimal number of steps and step size for each iteration.The algorithm generates a "tree" of proposals by evolving the Hamiltonian dynamics forwards and backwards in time.The number of steps is then chosen such that the trajectory does not turn on itself.This prevents the sampler from returning to previously explored regions of the parameter space.The algorithm then evaluates the Hamiltonian at each new proposal and computes an acceptance probability using MH.The step size is chosen such that a given target acceptance probability is met.The key feature of MCHMC is that it only makes use of a single Hamiltonian energy level to explore the whole parameter space; i.e. the Hamiltonian energy of the system is conserved.Mathematically this means that the target distribution is obtained as: As shown in Robnik et al. (2022), there is a multitude of Hamiltonians that satisfy Eq. 11.In this work, we use a Julia implementation of the variable mass Hamiltonian as presented in Robnik et al. (2022).MCHMC poses several advantages over traditional HMC.First, it does not require the Metropolis adjustment to return unbiased posteriors as in traditional HMC.Second, the bias on the estimated distributions depends on the Hamiltonian error per dimension while in traditional HMC it depends on the absolute Hamiltonian error.Third, the dynamics of the variable mass Hamiltonian result in the particle moving the slowest in the regions where the likelihood density is the highest.This is the opposite behaviour to traditional HMC where the particle moves the fastest where the likelihood density is the highest.In combination, these three features allow MCHMC to explore high dimensional parameter spaces more efficiently than traditional HMC specially for very large numbers of dimensions, as shown in Bayer et al. (2023).

Pathfinder
Sometimes exploring the parameter space can become too expensive even when the gradient of the likelihood is known.For this reason, the scientific community has developed a series of posterior approximation methods.
The most popular of such methods is variational inference (VI) (Ranganath et al. 2013;Kucukelbir et al. 2016).The goal of VI is to find a tractable approximate distribution to the posterior.In order to do so, VI uses a family of distributions (known as the variational family) in terms of the original parameters of the problem and a series of variational parameters.VI then finds the variational parameters that minimize the Kullback-Leibler (KL) divergence with respect to the posterior.This results in faithful approximations to the posterior even if the actual posterior falls out of the variational family.
Pathfinder is a VI algorithm that locates approximations to the target density along a minimization path (Zhang et al. 2022).Pathfinder uses L-BFGS, a quasi-Newton method that relies on gradients to estimate the Hessian matrix.While moving along the minimization path, samples are drawn from a multivariate normal distribution characterized by the estimated Hessian matrix.Pathfinder outperforms other VI methods in providing quick estimates of the posterior, as shown in Zhang et al. (2022) for a series of benchmark problems.
Moreover, Pathfinder can be run in parallel, greatly improving its estimates of the posterior by combining the approximations of each independent run using Pareto-Smoothed Importance Sampling (PSIS) at a low computational cost (Vehtari et al. 2022).Furthermore, PSIS provides a quantitative metric which can be used to assess the quality of the posterior found by Pathfinder.
Last but not least, Pathfinder can be used to accelerate MCMC methods, using its draws to start chains, whose initial points are usually drawn from the prior and hence are far from the typical set.Since Pathfinder draws are (very likely to be) on the typical set, this reduces the length of the so called burn-in phase, where chains approach the high mass probability region.7

RESULTS
In this section we discuss the main results of this work.We start by testing the accuracy of our emulators performing the MCMC analysis of three different datasets, the Planck (Aghanim et al. 2020b), ACT DR4 (Aiola et al. 2020), and 2018 SPT-3G (Balkenhol et al. 2023) datasets.The analyses are performed using both CAMB and Capse.jl and the obtained posteriors are compared in order to assess the emulator accuracy.In this subsection, we show the constraints obtained using data from the Planck satellite.
Specifically, we consider the compressed temperature and polarization Planck 2018 Plik lite likelihood, which is constructed from the high-ℓ T T + T E + EE CMB bandpowers and covariances that have been premarginalized over astrophysical contributions in the spectrum estimation step (Aghanim et al. 2020a).The only nuisance parameter left in this compressed likelihood is the overall Planck calibration factor A Planck which simply rescales the foreground-marginalized bandpowers as Ĉℓ → Ĉℓ /A 2 Planck .This likelihood is available in Cobaya as planck 2018 highl plik.TTTEEE lite.
In Fig. 2, we compare the contours obtained with CAMB and Capse.jl.The former has been used within Cobaya and a nuisance-marginalized Planck likelihood, with a termination criterion of | R − 1| < 0.01, requiring 6 hours to converge with 4 parallel chains, each of them with 4 threads, for a total of around 100 CPU-hours.The latter employed PlanckLite.jl,8 a pure Julia implementation of the Planck lite likelihood, and used 6 NUTS chains, with 500 burn-in and 1,000 accepted steps, on a laptop in around 4 minutes of wall-clock time using less than 0.5 CPU-hours, but still reaching an R − 1 smaller than 0.002 for all parameters.We emphasize that the NUTS analyses was performed with only a handful of parameters to tune: the number of parallel chains, the burn-in and sampling steps per chains, and the target acceptance ratio.We did not provide any input such as a covariance matrix or a proposal step-size for each parameter.This shows the robustness of the NUTS algorithm, which is able to perform a Bayesian analysis with little to no input from the user.This is very different from the MH sampler implemented in Cobaya, which greatly benefits in its performance from being finely tuned by the user.
Regarding the accuracy of the analysis, the two contours are very similar, with differences on the marginalized 1D posteriors smaller than 0.1 σ, as is shown in the lower half of Fig. 2.

ACT DR4
Here we show the results from the chains based on the ACT Data Release 4 (DR4) likelihood (Aiola et al. 2020).9This likelihood is constructed from the temperature and polarization T T /T E/EE power spectra measured with ACT data taken between 2013 and 2016 covering more than 17,000 deg 2 of the sky with spatially varying depth.Specifically, the ACT likelihood relies on cleaned CMB bandpowers in the multipole range 300 ≲ ℓ ≲ 4000 that have been marginalized over foreground emissions including the Sunyaev-Zel'dovich effects, diffuse Galactic emission, and the cosmic infrared background.The covariance of the CMB power spectra already includes contributions from foreground, calibration, and beam uncertainties.The only nuisance parameter that is allowed to freely vary is the overall polarization efficiency y p , which linearly and quadratically rescales the C T E ℓ and C EE ℓ spectra, respectively.Also in this case we have coded a pure Julia version of this likelihood, ACTPolLite.jl.10For the CAMB+Cobaya chains we used the same hardware settings from before.However, since we used the high-precision settings for CAMB, the computational resources employed are higher than before, being around 480 CPU-hours required to reach convergence.Regarding our Capse.jl+Turing.jl,we run as before 6 chains in parallel with 500 burn-in steps and 1,000 accepted steps, obtaining convergent chains with | R − 1| < 0.002 using around one CPU-hour.Also in this case the agreement between the CAMB and the Capse.jlchains is excellent, with differences on the posteriors smaller than 0.1 σ, as shown in the lower half of Fig. 3.

SPT-3G
The final CMB likelihood considered in this study pertains to the multifrequency 2018 SPT-3G temperature and polarization dataset, as detailed in Balkenhol  et al. (2023). 11This dataset comprises measurements of the CMB T T /T E/EE power spectra across ∼ 1,500 deg 2 of the southern Sky.Specifically, the T T power spectra span the angular multipole range from 750 < ℓ ≤ 3, 000,while the T E and EE spectra extend from 300 < ℓ ≤ 3, 000.Distinct from the PlancklLite.jland ACTPolLite.jllikelihoods, which are premarginalized over astrophysical emissions, the SPT-3G one employs a parametric model for foreground characterization in both temperature and polarization.Additionally, the SPT-3G likelihood accounts for the effects of instrumental calibration and beam uncertainties, aberration due to the relative motion with respect to the CMB rest frame, and super-sample lensing (Dutcher et al. 2021).Owing to its large number of foreground and nuisance parameters, the SPT-3G likelihood serves as an exemplary testbed for gradient-based sampling techniques.More precisely, in addition to the 6 ΛCDM parameters, we sample over 6 calibration parameters, 26 foreground parameters, and a nuisance parameter (super-sample lensing convergence), for a total of 39 parameters.A thorough description of the data model can be found in Balkenhol et al. (2023).A pure Julia implementation of the SPT-3G likelihood, SPTLikelihoods.jl, is made available in this case too. 12or the CAMB+Cobaya chains we used the same hardware settings from before.However, since we used the high-precision settings for CAMB, the computational resources employed are higher than before, being around 1,600 CPU-hours required to reach convergence.We also run 6 chains, with 500 burn-in steps and 1,000 accepted steps, using Capse.jl+Turing.jl: the obtained chains have a convergence with | R − 1| < 0.002 using 14 CPU hours.Also with this more challenging posterior there is an excellent agreement between the standard and the emulator chains, with differences on the posterior smaller than 0.1 σ, as shown in Fig. 4.

Samplers comparison
After validating our emulators, we now focus on the different inference algorithms and study their impact on the speed, accuracy, and robustness of the analyses.
The compatibility of the Capse.jlemulating framework with AD methods makes it extremely synergistic with a plethora of gradient-based samplers.Here we compare the performance of three gradient-based samplers (see Sec. 4 for details).First, we compare the state-of-the-art NUTS sampler used to obtain our fiducial results in Sec.5.1 with the recently developed MCHMC sampler.Second, we compare NUTS against a VI method, Pathfinder13 .In all cases we use the same likelihoods described in Sec.5.1 to draw the comparisons.This is possible thanks to the modularity of the Julia ecosystem, which allows us to write a likelihood in Turing.jland then sample it using different libraries and algorithms.

NUTS vs MCHMC
Before comparing the performance of the two samplers, it is essential to verify that NUTS and MCHMC return the same posterior distributions.As MCMC methods, both NUTS and MCHMC should asymptotically converge to the same target distribution.The comparison of the NUTS and MCHMC posteriors for Planck , ACT, and SPT analyses can be found in Figs. 2, respectively.The corresponding numerical values for the quoted constraints can be found in Tab. 3. In this table, we observe that the constraints obtained from both NUTS and MCHMC are within a 0.1σ distance of their Metropolis-Hastings (MH) counterpart.The excellent agreement between the two samplers indicates that differences are well within the sampling noise.
Having established that both samplers converge to the same posterior, the pivotal question now is how quickly they achieve convergence.To address this, we assess the effective sample size (ESS) per second, as it provides insight into the convergence rate of MCMC chains (Brooks et al. 2011).The auto-correlation among samples in an MCMC chain introduces uncertainty in estimating the mean and standard deviation of the posterior distributions.A higher number of effective samples per second indicates quicker convergence.
We compute the ESS of all chains using the Python library GetDist.When analyzing Planck , we find an ES-S/s of 1.6 for the NUTS sampler, while MCHMC returns an ESS/s of 3.7, as can be seen in Tab. 4. In the case of ACT, NUTS achieves 1.9 ESS/s while MCHMC reaches 4.1.Thus in both analyses we see that MCHMC is twice as fast as NUTS, as expected on the findings of Robnik et al. (2022).The difference is even bigger for the SPT analysis, since NUTS and MCHMC delivers an ESS/s of, respectively, 0.09 and 0.33514 .Overall, our findings show that the MCHMC sampler has an higher efficiency compared to the NUTS for the applications considered, as in (Bayer et al. 2023).15

NUTS vs Pathfinder
Comparing NUTS to Pathfinder is a more difficult task since Pathfinder is a VI method while NUTS is an MCMC method.VI methods do not guarantee

−
Table 3: Mean and 95% limits of the 1D marginal posteriors of the cosmological parameters obtained from the Planck , ACT, and SPT analyses, respectively.Different columns show the constraints obtained performing the same analyses using the sampling algorithms MH, NUTS, MCHMC and Pathfinder.The mean and limits have been computed using GetDist.Given the highly unstable results obtained when applying Pathfinder to the SPT posterior, we did not quote any number for this specific application.Table 4: Shows the ESS per second and computation time achieved by the Cobaya, NUTS, MCHMC and Pathfinder inference algorithms when performing the Planck (2nd and 3rd columns), ACT (4th and 5th columns), and SPT (6th and 7th column) analyses.Note that Cobaya used the cosmological code CAMB while the other samplers were paired with the Capse.jlemulator.Given the highly unstable results obtained when applying Pathfinder to the SPT posterior, we did not quote any number for this specific application.
asymptotic convergence to the target distribution, unlike MCMC methods.Instead, VI approximates the posterior distribution as a combination of possible distributions (see Sec. 4 for details).Moreover, the concept of ESS is not applicable to VI methods since they do not inherently produce MCMC chains, although MCMC-like samples can be generated by sampling the approximated posterior.Despite these caveats, we observe excellent agreement between the posteriors obtained with Pathfinder and NUTS for all the likelihoods considered.This can be observed in the top half of Figs. 2 and 3.In these plots it is possible to see that Pathfinder reproduces not only the mean and standard deviation of the marginal posteriors but also the different correlations between the parameters.Looking at Tab. 3 we can see that the Pathfinder constraints are once again within 0.2 σ of the CAMB+MH constraints. 1616 These results have been generated using the ensemble version To assess the reliability of Pathfinder, we computed 1,000 realizations of it and found that the mean obtained for each parameter follows a distribution which is close to a Gaussian with a standard deviation corresponding to less than 0.1, σ, measured on each parameter when performing a full MCMC analysis.
Regarding the SPT analysis, we found Pathfinder to provide unstable results when applied to this likelihood.For this reason, we decided not to show any result related to this combination of likelihood and sampler.
Regarding performance, since the ESS is not welldefined for VI methods, we compare the speed of the two methods by examining pure wall times.As shown in Tab. 4, Pathfinder successfully approximates the posteriors of the Planck and ACT analyses in around 10 seconds.This represents a 5 to 6 orders of magnitude im-provement compared to the Cobaya analyses using CAMB and a 2 orders of magnitude improvement compared to using Capse.jlwith either NUTS or MCHMC.However, we emphasize that the robustness of variational methods can significantly decrease when the posterior is everything but a multivariate normal distribution.However, even in these cases, methods such as Pathfinder still can prove useful as they can be used to initialize standard MCMCs, reducing the burn-in phase even when considering challenging posteriors with funnel-like geometries (Zhang et al. 2022).This is actually what we did for the SPT analysis: the Pathfinder draws were unstable and likely to be wrong, as detected by the PSIS.Nevertheless, they were way closer to the typical set than random draws from the prior and hence used them to initialize our NUTS and MCHMC chains.

MOPED compression & Chebyshev decomposition
All the MCMC results shown in the previous section are based on the standard emulator and likelihood.Although the favourable comparison with CAMB+Cobaya approach, there is still room for some computational improvements.In this section we show results based on PlanckLite.jl and on a MOPED compressed likelihood and the Chebyshev-based emulators.
We begin by discussing the standard likelihood results.As discussed in Sec.2.1, we exploit the Chebyshev decomposition to achieve peak performance.The PlanckLite.jl likelihood is composed of a sequence of linear operators that remain independent of cosmological parameters.Utilizing this property, we can apply these operators to the Chebyshev basis functions and precompute and store the results.During each step of the MCMC, we only need to multiply the stored quantities by the expansion coefficients, which solely contain the cosmological dependence.This approach eliminates the need to repeatedly apply the linear operator, resulting in enhanced computational efficiency.
Traditionally, optimizing the computation of the likelihood is not a priority since the bottleneck is the computation of the theory prediction.However, this is no longer the case when emulators are employed.For Capse.jl, the C ℓ 's computation is around 7 times faster than the computations required to compute the log-likelihood such as the Cℓ's binning.This highlights the performance of our emulators, since the bottleneck is now represented by the likelihood itself.When using NUTS and MCHMC we find an ESS/s of 13 and 32 respectively.Meanwhile, the computation time when using Pathfinder is reduced to 1.5 seconds.This shows an almost uniform performance improvement by a factor of 8 across the different inference methods considered.
Finally, we employ a MOPED likelihood (Prince & Dunkley 2019), a data compression technique which reduces the dimensions of the data vector to one number per parameter of interest, requiring an even lower amount of computational resources.We find an exquisite agreement between the PlanckLite.jl and the MOPED posteriors, with differences on marginalized parameters smaller than 0.1 σ as shown in Fig. C1, but with an improved computational performance: NUTS and MCHMC reach an ESS/s of, respectively, 60 and 150, while Pathfinder runs in 0.5 seconds.Compared to the performance obtained when using the standard emulator and PlanckLite.jlwe re- port an improvement of a factor of 40-50.This brings down the wall time for the Planck analysis to a record one second.The finding of this section are summarized in Table 5.

COMPARISON WITH OTHER EMULATORS
In this section we compare the emulators presented in this work to two closely related alternatives in the literature, ClassNet (Albers et al. 2019) and the recently developed CosmoPower17 (Spurio Mancini et al. 2022;Bolliet et al. 2023).ClassNet emulates the source functions required to compute the CMB angular power spectra using Python-based neural network emulators.The result of the emulation is then passed to CLASS, which computes the line-of-sight integrals.The main advantage of this approach is that ClassNet does not need to include A s and n s in the emulation space, since the primordial power spectrum is analytically computed and integrated against the emulated source functions.CosmoPower is a TensorFlow based emulator that follows a design philosophy closer to Capse.jl, as it directly emulates the CMB angular power spectra.Initially, it focused on the Planck analysis but it was later extended to ℓ max = 10, 000 and the same high-accuracy settings employed here (Bolliet et al. 2023).
Comparing emulators is challenging due to their distinct architectures and optimization goals.Here we highlight the differences between ClassNet, CosmoPower and Capse.jl,explaining the pros and cons of each framework.
We decided not to perform a comparison with other frameworks, such as those presented in Nygaard et al. (2023); Günther (2023); To et al. (2023) as they have a different goal: performing an analysis using the lowest possible number of Boltzmann code evaluations using active learning schemes.
In order to compare Capse.jl,ClassNet, and CosmoPower we will use several metrics.First, we will compare the computational performance of each emulator.Second, we will compare the training efficiency.Third, we will look at the flexibility of the emulator, i.e. how often would it need to be retrained based on the changes to the analysis that one wishes to perform.Finally, we will also study the efficiency of the sampling algorithms compatible with each emulator.
We report the speedup obtained by the three emulation frameworks when compared to a traditional Boltz-mann solver in the first row of Tab. 6.In this table, we can see that ClassNet, CosmoPower, and Capse.jlemulators report a speedup of a factor of 3,18 1,000, and 1,000,000 respectively.We would like to emphasize that the numbers shown in the comparison were taken directly from the ClassNet and CosmoPower release papers; their performance might improve with an hyperparameter optimization.
Comparing the training efficiency of the three emulators is not straightforward, primarily due to two main differences between them.
Firstly, the number of inputs (i.e. the cosmological parameters) and their ranges vary from emulator to emulator.While CosmoPower and Capse.jl were trained on the same number of input features, ClassNet requires two fewer inputs, significantly reducing the volume of training data needed to cover its domain.Secondly, the tessellation method to cover the parameter space can also differ.CosmoPower and Capse.jluse a Latin hypercube to span their domains, while ClassNet was trained on a hyperellipsoid centered around the bestfit Planck TT,TE,EE+lowE+lensing+BAO ΛCDM cosmology (Aghanim et al. 2020b), spanning 6 σ's.Regarding the number of learned features, ClassNet learns source functions on a k-grid with O(700) elements, chosen to accurately compute the Planck C ℓ 's.In contrast, Capse.jlcomputes all C ℓ 's up to ℓ max = 5, 000 while CosmoPower reaches an ℓ max of 11,000, making them both suitable for next-generation CMB experiments.
Considering the aforementioned caveats, training Capse.jlrequires 10 4 training points, which is of the same order of magnitude as ClassNet, and an order of magnitude less than CosmoPower Bolliet et al. (2023).In terms of network architecture, CosmoPower employs a one-size-fits-all approach with 4 fully connected hidden layers, each containing 512 neurons.ClassNet, on the other hand, tailors its networks for each emulator, utilizing different architectures with 2-4 hidden layers and 100-500 neurons.Capse.jlemploys a consistent architecture across all networks, consisting of 5 hidden layers with 64 neurons.In terms of training time, Capse.jltakes slightly less than one hour using a CPU, while the training of CosmoPower lasted one hour using a GPU.After training, Capse.jl is slightly more precise than CosmoPower for the ϕϕ coefficients, while CosmoPower is 2-3 times more precise for the T T , T E, EE coefficients.
Regarding flexibility, ClassNet is the most versatile of the emulators.As previously described, it only emulates the transfer functions and computes the line-of-sight integrals using CLASS.This allows the user to change the computation of the primordial power spectrum or the method to compute the line-of-sight integrals without having to retrain the emulator, as it would be the case for Capse.jl and CosmoPower.
Finally, we studied the efficiency of sampling algorithms compatible with each emulator.The wall time users will experience depends on both the speed of the emulator and the sampling algorithms available for Bayesian inference.
ClassNet employs traditional gradient-free sampling methods while Capse.jland CosmoPower are compatible with stateof-the-art gradient-based algorithms.
The original CosmoPower was written in TensorFlow and, coupled to TensorFlowProbability 19 , has in principle access to a rich variety of gradient-based samplers.Moreover, CosmoPower was recently re-written in JAX (Piras & Spurio Mancini 2023) and embedded in the JAX-COSMO framework, explicitly designed for gradient-based sampling methods.On the other hand, using these methods with ClassNet would be technically challenging due to its mixed C and Python implementation.Capse.jl was coupled with the Turing.jlPPL, enabling the use of NUTS, MCHMC, and Pathfinder inference algorithms, fully leveraging the differentiability of our emulators.
When comparing the wall time required to compute converging chains, ClassNet reports a speedup by a factor of 3. In Spurio Mancini et al. (2022), a Planck analysis was performed in O(10 s) using a GPU; however, as can be seen on the official notebook prepared by the authors, the chains were initialized on the typical set, using the knowledge coming from previous analyses.Capse.jl is able to perform a Planck analysis in around a minute with MCMC methods or O(1 s) with variational methods.We emphasize that Capse.jl'sruns were performed randomly initializing our algorithms from the prior.While there is nothing conceptually wrong in initializing the chains on the typical set, we do think that starting using random draws from the prior is more representative of what happens when analyzing a new dataset for the first time.
Finally, we want to emphasize that the results shown in this comparison are a direct consequence of the design choices of the emulators.ClassNet is the most flexible framework, but this comes at the expense of a lower speedup.CosmoPower delivers a higher speedup than ClassNet and has already been trained to analyze nextgeneration CMB surveys, but its training requires an expensive NN architecture and training dataset.Capse.jl is faster than CosmoPower, can be used to analyze actual datasets and is cheaper to train, but its multipole range and accuracy are slightly lower than the one of CosmoPower.According to the scenario considered, one of the three emulators considered will better suit the requirements of the analysis.We postpone to a further publication a detailed investigation of the origin of the higher runtime performance of Capse.jl.

CONCLUSIONS
In this paper, we introduced Capse.jl, a novel emulator framework for CMB angular power spectra.We discussed the preprocessing steps and the characteristics of the NN library employed, highlighting the distinctions from similar emulators in the literature (Spurio Mancini et al. 2022;Günther et al. 2022).We additionally developed an emulator based on Chebyshev polynomial decomposition, which significantly enhanced the computational performance of the original emulator by reducing the number of output features.Leveraging the Chebyshev linear decomposition further improved the efficiency of the log-likelihood evaluation.
To validate the accuracy of our emulators, we conducted two tests.First, we compared Capse.jl and CAMB  CosmoPower, here we refer to the emulator presented in Bolliet et al. (2023).
computations over 20,000 combinations of input parameters across the entire emulation range.Our emulators exhibited errors below 0.1 σ for all scales relevant to future CMB surveys.Second, we analyzed the Planck and ACT DR4 nuisance-marginalized datasets as well as the full 2018 SPT-3G dataset, comparing the posteriors with those obtained using CAMB and Cobaya.The derived contours showed a remarkable agreement, better than 0.1 σ.
Next, we examined the performance impact of different sampling algorithms.For this purpose, we considered several gradient-based samplers, namely NUTS, the state-of-the-art Hamiltonian Monte Carlo sampler, Mi-croCanonical Hamiltonian Monte Carlo, a recently developed sampler by Robnik et al. (2022) and Robnik & Seljak (2023), and Pathfinder, a state-of-the-art variational inference algorithm.While CAMB paired with Cobaya required around 100, 480, and 1,600 CPU hours to analyze Planck , ACT and SPT-3G data respectively, Capse.jlcombined with the NUTS algorithm required only 0.5 CPU hours for the Planck analysis, one CPU hour for the ACT analysis, and 14 hours for the SPT-3G analysis.Thus our analyses were at least two orders of magnitude more efficient.When comparing NUTS to MCHMC, we found the latter to be more efficient by a factor of 2-3, while producing virtually identical posteriors.Pathfinder was able to perform the Planck and ACT analysis in 1-10 s, with a very good precision.However, this approach is less robust when compared to MCMC methods as it lacks asymptotic convergence guarantees.
In this work we also tested a data reduction technique, the Chebyshev polynomial decomposition.This technique was instrumental in improving Capse.jlcomputational performance, as it improved the Planck analysis efficiency by almost an order of magnitude.
While our emulators and methods exhibits high precision and computational performance, there remain opportunities for future improvements.First, we aim to enhance the preprocessing procedure, addressing the reduced precision of Capse.jl for low-ℓ EE 2-point function spectra by employing a proper τ rescaling.Implementing a weighted mean square error loss function (similar to Nygaard et al. (2023)) based on the CMB power spectra variance could also lead to a more homogeneous emulation error at no additional computational cost.We also plan to work on refining the Chebyshev polynomial decomposition approach.Although it demonstrated precision and efficiency within the Planck multipole range, we intend to explore ways to improve its robustness when considering higher ℓ max values.In particular, some preliminary tests show that we can reach the same ℓ max of CosmoPower, while maintaining the computational efficiency of our framework and even improving the precision of our emulators with respect to what we showed in this paper.
Working on the Chebyshev polynomial decomposition approach might not only be relevant for CMB analyses but also for spectroscopic galaxy clustering analyses.The estimators used to extract cosmological information from galaxy surveys (Feldman et al. 1994;Yamamoto et al. 2006) suffer from an imprint of the survey window geometry.In order to take into account this effect, the community has developed several methods tailored for power spectrum (Blake et al. 2013;Beutler et al. 2017) estimation.However, only recently this issue has been addressed for the bispectrum (Pardede et al. 2022;Alkhanishvili et al. 2023).We plan to investigate whether the Chebyshev polynomial decomposition can be useful for this kind of application.CMB-SO (ΛCDM) . : Distribution of the relative difference between Capse.jl and the high-accuracy predictions from CAMB across the four 2-point statistics emulated in this work, normalized to the forecasted SO statistical uncertainties (f sky = 0.4).The differences are in units of cosmic variance below ℓ < 30.The shaded regions denote the 68%, 95%, and 99% percentiles of the distributions, respectively.

Figure 2 .
Figure 2. : Triangle plot, showing the standard Planck lite chains and the one obtained using Capse.jl in combination with Turing.jl.The lower triangular part of the plot focuses on the comparison between CAMB and Capse.jl, in order to ensure the precision of our emulators and likelihoods.The upper half of the plot shows only Capse.jl contours, with a comparison of the results between the samplers employed in our analysis, namely NUTS, MCHMC, and Pathfinder.

Table 2 :
Priors on the cosmological and calibration parameters varied in the analyses of Planck and ACT datasets.U (a, b) denotes a uniform distribution between [a, b], while N (µ, σ) indicates a Gaussian distribution with mean µ and variance σ 2 .

Table 5 :
Shows the ESS per second and computation time achieved by the NUTS, MCHMC and Pathfinder inference algorithms when performing the Planck Analysis.Note that here we are using the Chebyshev based emulator.

Table 6 :
A comparison between the emulating frameworks ClassNet, CosmoPower, and Capse.jl.Note that the architecture of ClassNet involves 7 different NN with 3 to 4 layers each adding to a total 5366 neurons.Regarding Bonici, Bianchini & Ruiz-Zapatero B. SIMONS OBSERVATORY RESIDUALS DISTRIBUTION