Machine Learning the Dark Matter Halo Mass of Milky Way-Like Systems

Despite the Milky Way's proximity to us, our knowledge of its dark matter halo is fairly limited, and there is still considerable uncertainty in its halo mass. Many past techniques have been limited by assumptions such as the Galaxy being in dynamical equilibrium as well as nearby galaxies being true satellites of the Galaxy, and/or the need to find large samples of Milky Way analogs in simulations.Here, we propose a new technique based on neural networks that obtains high precision ($<0.14$ dex mass uncertainty) without assuming halo dynamical equilibrium or that neighboring galaxies are all satellites, and which can use information from a wide variety of simulated halos (even those dissimilar to the Milky Way) to improve its performance. This method uses only observable information including satellite orbits, distances to nearby larger halos, and the maximum circular velocity of the largest satellite galaxy. In this paper, we demonstrate a proof-of-concept method on simulated dark matter halos; in future papers in this series, we will apply neural networks to estimate the masses of the Milky Way's and M31's dark matter halos, and we will train variations of these networks to estimate other halo properties including concentration, assembly history, and spin axis.


INTRODUCTION
In the current ΛCDM paradigm, dark matter is the dominant type of matter.For example, we expect that the Milky Way is surrounded by a dark matter halo that makes up most of its total mass.Because dark matter is not visible, it has been difficult to directly measure this mass around the Milky Way (MW), and hence there have been many studies that have attempted to estimate the Milky Way's dark matter content via other means (e.g., Oort 1926;Morrison et al. 2000;Yanny et al. 2000;Battaglia et al. 2005a;Frinchaboy & Majewski 2008;Li & White 2008a;Busha et al. 2011a;van der Marel et al. 2012a;King et al. 2015;Lowing et al. 2015;Patel et al. 2017;McMillan 2017a;Patel et al. 2018b).
Recently, Wang et al. (2020) reviewed the most common techniques that have been used to measure the Milky Way's halo mass, which we summarize here: 1. Estimating the Galactic escape velocity using highvelocity objects: High-velocity stars do not remain in the Milky Way's potential well for a long time, and therefore the velocity distribution of MW stars rapidly decreases above the escape velocity.Since the escape velocity is related to the halo mass profile, it is then possible to estimate halo mass from the measured stellar velocity distribution (e.g., Smith et al. 2007;Piffl et al. 2014;Williams et al. 2017;Monari et al. 2018;Deason et al. 2019;Grand et al. 2019).
* E-mail: ehayati@arizona.edu† LSSTC DSFP Fellow ‡ Hubble Fellow 2. Measuring the rotation curve: Circular velocities can be measured for gas in the interstellar medium (ISM) as well as maser sources and disk stars.In dynamical equilibrium, these are related to the enclosed mass via  enc ∝  2 /, with the constant of proportionality dependent on the assumed asphericity of the mass distribution (e.g., Klypin et al. 2002;McMillan 2011;Pawlowski et al. 2012;Irrgang et al. 2013;McMillan 2017b;Nesti & Salucci 2013;Cautun et al. 2020).

Modeling tracers (halo stars, globular clusters, and satellite galaxies) with the Spherical Jeans equation:
For regions beyond the Galactic disk, one can measure the radial velocity dispersion and velocity anisotropy of tracers and infer the enclosed mass using the Jeans equation.This method requires an assumption for the density profile, which has been determined to have a power-law form locally; this form is typically assumed valid to very large distances.The radial velocity dispersion is often measured observationally by assuming that it is the same as the line-of-sight velocity dispersion.The velocity anisotropy is determined by proper motion measurements of the tracers, which is a key uncertainty in this method since it is difficult to obtain high-quality proper motion data for tracers at large distances (e.g., Battaglia et al. 2005b;Dehnen et al. 2006;Xue et al. 2008;Watkins et al. 2010;Gnedin et al. 2010;Bhattacharjee et al. 2014;Huang et al. 2016;Ablimit & Zhao 2017;Sohn et al. 2018;Zhai et al. 2018;Fritz et al. 2020).
4. Modeling tracers (halo stars, globular clusters, and satellite galaxies) with phase-space distribution functions: Using the assumption of steady state structure as well as an assumption about the shape of the potential, one can calculate phase-space distribution functions, i.e., the observed distributions of orbital energy and angular momentum for tracers of the potential.Via forward modeling of the true observations, it is then possible to reverse this process to infer the underlying gravitational potential well and the halo mass (e.g., Zaritsky et al. 1989;Kochanek 1996;Wilkinson & Evans 1999;Sakamoto et al. 2003;Deason et al. 2012;Eadie et al. 2015Eadie et al. , 2017;;Eadie & Jurić 2019).
5. Simulating and modeling the dynamics of stellar streams: Stellar stream shapes around the Galaxy provide information about galactic evolution and the underlying gravitational potential.The path of the stream and the different orbital speeds of objects along the streams tell us about the tidal forces that the object experienced, which can then be related to the potential well shape and the halo mass (e.g., Lin et al. 1995;Law et al. 2005;Newberg et al. 2010;Gibbons et al. 2014;Kupper et al. 2015;Hendel et al. 2018;Malhan & Ibata 2019;Erkal et al. 2019).

Modeling the motion of the Milky Way, M31, and other distant satellites under the framework of the Local
Group timing argument: Despite the expansion of the Universe, Andromeda and the Milky Way are approaching each other because of their gravitational pull.Under the assumption that the two galaxies are in a Keplerian orbit, one may infer their total mass by measuring other orbital properties including their relative velocity, their distance, and the age of the Universe (e.g., Kahn & Woltjer 1959;Zaritsky et al. 1989;Li & White 2008b;van der Marel et al. 2012b;Sohn et al. 2013;Zaritsky et al. 2020;Zhai et al. 2020;Chamberlain et al. 2023).

Measurements made by linking the brightest Galactic satellites to their counterparts in simulations:
In this method, one uses a Bayesian framework to measure the mass of the Milky Way by selecting simulated halos (i.e., from a dark matter simulation) that have satellites that are most similar to the satellites of the Milky Way.
To select the best matches, it is important to have the proper motion of the satellites, as it has been shown that specific angular momentum is often a better constraint than knowing only the position, radial velocity, or orbital energy of the satellites (e.g., Busha et al. 2011b;Cautun et al. 2014;Patel et al. 2017;Li et al. 2017;Patel et al. 2018b).
Each of the above techniques requires assumptions, which contribute to systematic uncertainties in constraining the Milky Way's dark matter halo mass.Most of the techniques above assume dynamical equilibrium for the Milky Way's halo.Dynamical equilibrium is known to be violated at small radii due to the passage of the Large Magellanic Cloud (e.g., Laporte et al. 2018;Garavito-Camargo et al. 2019) near the center of the Milky Way, and at large radii by continued accretion onto the halo (e.g., McBride et al. 2009;Behroozi et al. 2013c).Nonetheless, dynamical equilibrium techniques share a strength that observations from arbitrary numbers of tracers can be combined.
The techniques that do not assume dynamical equilibrium rely on ΛCDM simulations.While these methods can be designed to avoid systematic biases from out-of-equilibrium systems, they are limited in the amount of data they can combine: the more observational data one has, the more difficult it is to find simulated halos that match all the observational constraints simultaneously (see discussion in, e.g., Patel et al. 2018b).
Here, we use a new approach for measuring the Milky Way's dark matter halo mass.We train a neural network on simulated galaxies to learn the transformation for linking observable galaxy properties (starting with the specific angular momenta of satellites of the Milky Way) to halo masses.This method has the following benefits: 1.No dynamical equilibrium assumptions are made.
2. No assumptions about most nearby galaxies being satellites are made.
3. The approach can learn about relationships between observables (e.g., satellite orbits) and mass even from halos that do not match the MW or M31, leading to greater constraining power.
4. Arbitrary constraints from the local or larger-scale environment (e.g., distance and/or velocity offsets to the nearest larger halo) can be self-consistently included.
This paper is the first in a series that will explore the ability of neural networks to constrain the properties of the Local Group's dark matter distribution.The goal in this paper is to explore how well the technique performs before adapting it to the Milky Way or M31 and their satellite systems.In the appendix, we consider a generic error model that is independent of the sky position where satellites are detected.While beyond the scope of the current paper, neural networks in the future will also provide the advantages of: 1. Being able to use arbitrary non-dark matter tracers (e.g., gas rotation curves in hydrodynamical simulations) as input features to neural networks to achieve the most accurate mass constraints.
2. Being able to use domain adaptation techniques (e.g., Ćiprĳanović et al. 2022) to identify mass-observable relationships that are independent of baryonic physics differences across hydrodynamical simulations.
3. Being able to estimate other halo properties as well, just by changing the training target to other halo properties.Such properties could include the halo spin axis, halo concentration, and halo mass assembly history, with minimal additional effort.
In this paper, we use dark matter halo simulations to train neural networks to estimate masses across a broad halo mass range (10 8 −10 14  ⊙ ).Inputs to the neural networks are based on observables including neighboring galaxy orbits, maximum circular velocity of the largest satellite, and distances to nearby more massive halos.In this paper, we take the limit of perfect information, assuming that no observational errors exist, as well as test the impact of a fiducial observational error model.In the second paper in this series, we will convolve simulated halo and galaxy properties with realistic observational errors, re-train the network, and use observed satellite orbits from Gaia DR3 to estimate the mass of the Milky Way's and Andromeda's dark matter halos.In the third paper in this series, we will extend the analysis to predict Milky Way halo properties beyond mass, including concentration, spin axis, and assembly history.
This paper is organized as follows.In Section 2, we describe the training process and dark matter simulations; in Section 3, we illustrate the performance of the resulting neural networks; and we discuss these results and provide conclusions in Section 4. Appendix A provides results including fiducial observational errors.We assume a flat, ΛCDM universe with Ω  = 0.307, Ω Λ = 0.693,   = 0.96, ℎ = 0.68 and  8 = 0.823.We adopt the virial halo mass definition  vir from (Bryan & Norman 1998), i.e., the total mass (dark + baryonic) within a radius  vir of a density peak.

Dark Matter Simulation
For this work, we use the public Very Small MultiDark Planck (VSMDPL) simulation with 3840 3 dark matter particles, each of mass 6.2 × 10 6  ⊙ /ℎ.The simulation is based on a flat, ΛCDM universe with Ω  = 0.307, Ω Λ = 0.693,   = 0.96, ℎ = 0.68 and  8 = 0.823.It evolves matter from  = 150 to  = 0 within a periodic cube of side length 160 comoving Mpc/ℎ.There are 151 snapshots with identified halos between  = 0 and  = 25.Halos are identified using Rockstar (Behroozi et al. 2013a), and merger trees are identified using the Consistent Trees algorithm (Behroozi et al. 2013b).Each halo is identified in the merger trees as a central halo or as a satellite halo (i.e., a halo contained within the virial radius of a larger halo).We adopt the virial halo mass definition  vir , i.e., the total mass (dark + baryonic) within a radius  vir of a density peak, such that the average density enclosed is  vir from Bryan & Norman (1998).

Intuition for Using Specific Angular Momenta
One of the principal inputs to our neural networks is specific angular momenta of neighboring galaxies.Under our halo definition, both the halo radius and the halo circular velocity ( √︁  /) scale as halo mass to the one-third power.As a result, the characteristic distances and velocities of the satellite halos with respect to the host halo (which by dimensional analysis are proportional to the halo radius and circular velocity) both scale as host halo mass to the onethird power.The characteristic specific angular momenta of satellites then depends on halo mass to the two-thirds power: This characteristic scaling is evident across a broad mass range for all central halos in our simulation in Fig. 1, which demonstrates the average specific angular momenta of the 30 largest neighbors versus central halo mass.
As discussed in Patel et al. (2018b), the specific angular momentum of the satellite galaxies provides strong constraints on host halo mass.As shown in Fig. 1, the  2/3 vir scaling is evident for a very wide range of halo masses.Only halos above 10 13.5  ⊙ start to show a bend in the scaling relation, due to more radial orbits for massive halos.Additionally, halos below  vir = 10 12.5  ⊙ show scatter towards high specific angular momenta, which occurs for lower-mass halos that are near much more massive halos.In this paper, we do not assume advanced knowledge of which nearby galaxies are satellites and which are not.Nonetheless, satellite angular momenta are approximately conserved throughout their orbits (Patel et al. 2018b).Hence, even when bound and unbound galaxies are mixed in a given vicinity of a halo, the bound galaxies' orbits will appear as an overdensity in the specific angular momentum distribution of the neighboring galaxies, and so specific angular momenta still provide useful information about host halo mass.

Halo Selection and Input Features
To train our deep neural networks, we first select halos with peak masses (i.e., their largest historical halo mass) larger than 10 8  ⊙ from the VSMDPL simulation, as the simulation does not resolve lower-mass halos well.These are also the only halos expected to host galaxies for which proper motions can be measured, due to the atomic cooling limit suppressing star formation in lower-mass halos (e.g., O'Shea et al. 2015).In contrast to past studies, we place no additional prior or selection on host halo masses, as this information comes from observables alone in our method.
Past studies to infer mass have typically assumed that all nearby galaxies are satellites of the Milky Way, which places a strong prior on host halo mass.Because we do not know this to be the case in reality, we drop this assumption in this study, instead using the orbital properties (including specific angular momentum , radial distance , and relative velocity ) of the largest neighboring halos out to a fixed distance as our main input features.For this paper, we select neighboring halos out to 200 kpc from central halos, corresponding approximately to the distance out to which proper motions can be measured for Milky Way satellite candidates with Gaia.
In particular, we do not make any cuts on whether the neighbors are bound or not, as this information is not known a priori from the observations.Past studies, including Patel et al. (2018b), used the specific angular momenta of ∼10 satellites to infer the mass of the Milky Way's halo, whereas the Gaia mission has now provided 6D phase space information (and therefore angular momenta) for ∼50 satellites (Li et al. 2021;Fritz et al. 2018;McConnachie & Venn 2020) within 200 kpc.Hence, we train a 10-neighbor neural network (3,093,208 halos) to compare our approach with past approaches, and we also train a 30-neighbor neural network (222,612 halos)to show the improvement possible with our new approach.
In tests, we found that dropping the assumption of satellite membership made it very difficult for networks that used angular momenta alone to reliably estimate host halo mass.As discussed in later sections, the neighbors of low-mass halos (< 10 11  ⊙ ) do not have specific angular momentum distributions that correlate with halo mass; because low-mass halos are much more numerous than high-mass halos, training results in networks that try to limit the worst-case performance for lowmass halos, rather than improve the best-case performance for higher-mass halos.However, adding some observable information that correlates broadly with host halo mass can help networks discriminate between the cases where the specific angular momentum of neighboring halos correlates with halo mass and where it does not.
In this work, we use the maximum circular velocity,  max , of the most massive satellite (the Large Magellanic Cloud in the case of the Milky Way, or M33 in the case of Andromeda) to help the networks distinguish between whether they are in the low-mass (neighbor angular momenta uncorrelated with host halo mass) or high-mass (neighbor angular momenta correlated with host halo mass) regimes.Using  max of the largest satellite in this way follows from past studies that have also done so (see, e.g., Busha et al. 2011a;Patel et al. 2017Patel et al. , 2018b;;Patel & Mandel 2023).
From Fig. 1, we know that nearby massive halos can influence the angular momentum distributions of satellites.Hence, we also include input features corresponding to the distance to the nearest larger halo ( larger ) and the distance to the nearest larger halo with  vir ≥ 10 14  ⊙ ( 14 ).At high mass, these quantities converge by definition.

Network Training
Neural networks consist of interconnected nodes organized into layers, and they are capable of learning intricate patterns and relationships from data.We have used a deep neural network (NN) for our regression task of estimating halo mass from galaxy observables.Deep NN's are commonly used for image-and language-related tasks, but they can also be applied to arbitrary structured data as in this paper.
The hyper-parameters and structure that we used in our neural networks are as follows: 1. Input Size: Our input layer has 3 features for the orbital properties (  [specific angular momentum],  [distance from halo center],  [velocity offset from halo center]) of each neighboring halo, as well as an additional 3 features for the target halo's environment ( max of most massive satellite, distance to nearest larger halo, and distance to nearest 10 14  ⊙ halo).For the 10-neighbor network, this totals 33 input features, and for the 30neighbor network, this totals 93 input features.
2. Layer Architecture: We use 5 fully-connected hidden layers.Each hidden layer (i.e., a layer in between the input and output layers) contains neurons that apply a nonlinear transformation to the input features, which are taken from the outputs of the previous layer.Fully connected layers are those in which every neuron in a given layer receives an input from every neuron in the previous layer.Initially, we have 10 neurons in the first hidden layer.Progressing through the network, we decrease the number of neurons in each subsequent layer (8,6,4,2).This is known as a decreasing architecture, and it helps in reducing the complexity of the information passed through each layer as we go deeper into the network.
3. Activation Function: We have used Rectified Linear Unit (ReLU) activation functions in our hidden layers.
ReLU is a common choice because it introduces nonlinearity into the model while being computationally efficient.Nonlinearity is essential in neural networksotherwise the action of the neural network could be represented by a linear transform (i.e., a matrix multiplication), which would prevent it from learning complex, nonlinear relationships between the input and output data.
4. Output Layer: We have a single neuron in the output layer, since our network is performing regression to predict a single output (i.e., the mass of a central halo).

Loss Function:
We have chosen Mean Squared Error (MSE) as our loss function, i.e., the metric by which we judge the neural network's performance.MSE is commonly used for regression tasks and calculates the average of the squared differences between predicted and actual values.It penalizes larger errors more heavily.
6. Optimizer: We have chosen the Adam optimizer.Adam is an adaptive learning rate optimization algorithm that combines the benefits of two other popular optimizers, RMSprop and Momentum, in that it adaptively chooses how far to proceed along the gradient of the loss function for each update to the neural network parameters.
It is well-suited for a wide range of problems and often converges faster than traditional stochastic gradient descent (SGD).
7. Learning Rate: Our learning rate is set to 0.001.This parameter controls the initial step size during optimization.The value of 0.001 is a common starting point, but its value can be tuned depending on the specific problem and data set.
8. Batch Size: Our batch size is 64.This determines the number of input data points used in each update of the neural network's weights during training.Smaller batch sizes can lead to noisier updates but are more computationally efficient, while larger batch sizes provide smoother updates but require more time to compute each update.
For training, we select all central halos with at least  neighbors within 200 kpc (with  = 10 or 30, as appropriate).As above, we place no prior on central halo mass, so these halos range from ∼ 10 8 − 10 15  ⊙ .We use three orbital parameters ( , , and ) for each of the  neighbors with the highest peak  max as inputs to the neural network, as a proxy for the brightest galaxies (Reddick et al. 2013).We also use the  max of the most massive satellite (corresponding to the  max of the Large Magellanic Cloud for the Milky Way), the distance to the nearest larger halo (corresponding to the Input features include neighboring halos' specific angular momenta ( ), radial distances (), and relative velocities (), as well as the maximum circular velocity of the most massive satellite ( max,sat ), the distance to the nearest larger halo ( larger ), and the distance to the nearest halo with  vir > 10 14  ⊙ ( 14 ).For all networks (regardless of the number of inputs), there are 5 hidden layers gradually decreasing from 10 nodes to 2 nodes, with one output layer corresponding to the predicted halo mass.
distance to M31 for the Milky Way), and the distance to the nearest 10 14  ⊙ or larger halo (corresponding to the Virgo Cluster for the Milky Way) as input parameters.As above, the 10-neighbor network has 33 input features, and the 30neighbor has 93 input features.
We used simulation snapshots from  = 0 to  = 0.25 from the VSMDPL simulation to increase the diversity of neighboring halo orbital configurations available for training.We found that including training data from earlier snapshots did not cause a measurable bias in median predicted masses for  = 0 halos, suggesting that the distribution of orbital configurations has not changed significantly over this redshift interval.Halos are split into a training sample (63%) and a test sample (37%) according to whether the halos have an Xcoordinate less than or greater than 96 Mpc/ℎ (compared to an overall box length of 160 Mpc/ℎ).This division is made to capture the uncertainties arising both from Poisson statistics and larger-scale cosmic variance.
To pre-process, we ordered neighboring halos by increasing specific angular momenta, took the logarithms of all input features, subtracted the mean values across all neighbors, and scaled to unit variance.We then trained two 5-layer fully connected neural networks on the 10-and 30-neighbor input feature vectors to predict the masses of the corresponding central halos.The details of the network structure are shown in Fig. 2, and the details of the hyper-parameters are shown in Table 1.
We varied several different hyper-parameters for the training process: the number of layers, the learning rate, the number of nodes per layer, the loss function, and the batch size.We used a hand search to tune the learning rate, batch size, and loss function.For the rest of the hyper-parameters, we started with a simple network and increased the size until the mean-squared error did not improve further.
We did not find any substantial improvements over the fiducial choice of parameters in Table 1, and in some cases found worse performance.For example, when using optimizers such as RMSprop or Adagrad, we observed that the network exhibited a loss of prediction accuracy, particularly at the high mass end.This suggests that these optimizer choices may have gotten stuck in local minima, as performance for the vast majority of the halo sample (i.e., low mass halos) was prioritized over performance for high-mass halos.

Performance of the neural network approach
We measure the performance of the neural network approach by applying the trained network to halos that it has never seen before (i.e., halos in our test set).The variance of the predicted halo masses at fixed actual halo mass then corresponds to the expected uncertainties of the network when applied to new data, such as for the Milky Way and M31.Hereafter, we quote network uncertainties at an actual halo mass of 10 12  ⊙ to represent the expected performance for the Milky Way and M31.
Fig. 3 summarizes the results of our work, demonstrating that the specific angular momenta of neighboring galaxies can be used to accurately infer the masses of central halos.The medians of the neural networks' predicted masses (in bins of actual halo mass) closely match actual halo masses, with typical median offsets of ≲ 0.03 dex at halo masses of 10 12  ⊙ .However, the uncertainty in the predicted masses is significantly larger for low-mass halos (below a threshold of ∼ 10 11.7  ⊙ ) compared to high-mass halos.The size of the uncertainty is primarily influenced by whether the neighboring galaxies within 200 kpc are satellites or not.We investigate this aspect further in the next subsection.
The bottom plots in Fig. 3, show the RMS magnitudes of the errors across the full range of predicted masses.Specifically for MW-mass halos (again considering a threshold of  vir ≳ 10 11.7  ⊙ ), the typical errors are ∼ 0.17 dex when using 10 neighboring halos, and they are ∼ 0.12 dex when using 30 neighboring halos, corresponding to a 30% reduction in uncertainty.Since the ratio of these errors is less than expected from Poisson statistics (0.17/0.12 ∼ 1.4 < √︁ 30/10 ∼ 1.7), this may be caused by correlated orbits known to occur in ΛCDM simulations, such as satellites coming in along the same filaments or even some satellites being satellites of other satellites (e.g., Patel et al. 2020;Erkal et al. 2020;Battaglia et al. 2022).

Understanding what information constrains halo masses
To analyze the relationships between satellite specific angular momenta ( ), relative velocities (), and radial distances () with respect to halo mass, we present Figures 4 and 5.These figures illustrate the distributions of neighboring halos' orbital properties, where the left-hand panels are color-coded by the most massive satellite's maximum circular velocity ( max,sat ), and the right-hand panels are color-coded by  14 , the distance to the nearest massive halo ( vir > 10 14 M ⊙ ).
The overall distributions of , , and  exhibit distinct patterns, particularly with larger spreads observed for low-mass halos compared to high-mass halos.This can be attributed to the neighbors of high-mass halos being predominantly satellites of the high-mass halo, so the high-mass halo has a strong influence on its neighbors' orbits.However, neighbors of lowmass halos are typically not satellites and hence the presence of the low-mass halo does not strongly influence their orbits.Therefore, the distributions of neighbors'  and  are much more correlated with halo mass for high-mass than low-mass halos.
The color coding in the left-hand plots shows a smooth progression with actual halo mass, demonstrating a strong correlation between halo mass and the maximum radial velocity of the most massive satellite ( max,sat ).Hence, the neural networks can effectively utilize satellite orbit information for The fiducial hyper-parameters used to train the network (in bold), as well as variations explored.Fig. 3.-Predicted halo mass versus actual halo mass for the neural networks in this paper applied to dark matter simulations.Input features to the networks correspond to observables, primarily including neighboring galaxies' specific angular momenta and other orbital properties.The left figure shows the result from halos with at least 30 neighboring galaxies, with reduced errors compared to the right figure, which used halos having at least 10 neighboring galaxies.In each figure, the bottom panels show the root mean square error (RMSE) as a function of actual halo mass.Typical errors are very good in both cases, about 0.17 dex for Milky Way-mass halos for the network using 10 neighbors and 0.12 dex for Milky Way-mass halos for the network using 30 neighbors.Error bars show the standard deviations of the predicted halo masses as a function of actual halo mass.The black line shows medians of predicted halo masses in bins of actual halo mass.The red line serves as a reference to indicate where the predicted mass would be equal to the actual mass.large halo masses (when orbit information correlates with halo mass), while relying on the most massive satellite's maximum radial velocity as the best estimate when neighboring objects are not satellites.
The right-hand plots reveal that the presence of massive nearby halos biases neighbors' orbits, especially for low-mass halos.This outcome is expected since tidal forces from highmass halos exert influence on the orbits of all neighboring halos, resulting in increased relative velocities between the low-mass halo and its neighbors.Additionally, massive halos have high satellite velocities, and because the orbits of satellite halos often extend beyond halos' virial radii (where they become known as "backsplash" or "flyby" halos; see, e.g., Diemer 2021;O'Donnell et al. 2021), some neighboring halos around low-mass halos will have orbits that are strongly influenced by their high-mass neighbors.
Figures 6 and 7 show the relationship between three variables: distance to the nearest  vir > 10 14  ⊙ halo ( 14 ), distance to the nearest larger halo ( larger ), and  max of the most massive satellite ( max,sat ), with respect to halo mass.
The parameter  max,sat has a very strong correlation with host halo mass, as larger halos typically host larger satellites.
The parameter  larger also exhibits a correlation with halo mass.However, the relationship is weaker and exhibits a different shape from that of  max,sat .Larger halos are relatively less common, which directly implies that the distances between large halos tend to be larger than the distances between small halos, despite the fact that larger halos are more biased relative to the underlying dark matter distribution.We also note that there is a kink in the median relation between  larger and halo mass at  vir ∼ 10 11  ⊙ , which occurs because we are selecting halos with at least 30 neighbors within a 200 kpc radius; for low-mass halos, this preferentially selects halos in dense environments, i.e., for which the distance to surrounding halos is significantly decreased.
Finally, unlike  max,sat and  larger ,  14 does not exhibit much correlation with halo mass.This indicates that halos of varying masses are present across different environments, leading to a wide range of  14 values irrespective of halo mass.-Left: the median specific angular momentum of neighboring halos as a function of halo mass, for halos that have at least 30 neighbors within 200 kpc.Halos are color-coded by the most massive satellite's maximum circular velocity, which correlates with host halo mass.Here, the neighbors of high-mass halos are much more likely to be satellites and thus have orbits with correlated specific angular momenta.In contrast, low-mass halos usually have non-satellite neighbors, which are less influenced by the low-mass halo's presence.So, the distributions of neighbors' specific angular momenta are much more correlated with halo mass for high-mass than low-mass halos.Right: the median specific angular momentum of halos' neighbors, now color-coded by the distance to the nearest massive halo ( ℎ > 10 14  ⊙ ).Gravitational forces from high-mass halos impact the orbits of all nearby halos, leading to higher relative velocities between low-mass halos and their neighbors.Moreover, massive halos have satellites that possess high velocities, and as these satellites' orbits can extend beyond the virial radii of the massive halos, they can pass nearby other lower-mass halos even as they have very large specific angular momentum offsets.Hence, the largest median specific angular momenta typically occur near massive halos.Halos are color-coded by the most massive satellite's maximum circular velocity, which correlates with host halo mass.Here, the neighbors of high-mass halos are much more likely to be satellites and thus have orbits with correlated relative velocities.In contrast, low-mass halos usually have non-satellite neighbors.As in Fig. 4, the distributions of neighbors' relative velocities are much more correlated with halo mass for high-mass than low-mass halos.Right: median relative velocities of halos' neighbors, now color-coded by the distance to the nearest massive halo ( ℎ > 10 14  ⊙ ).As in Fig. 4, the largest median neighbor relative velocities typically occur near massive halos.
To confirm our interpretation that neighbors of low-mass satellites are not providing any information about host halo masses, we trained a network just with three parameters ( 14 ,  larger and  max,sat ), and found similar errors for low-mass halos as compared to the network provided with full information about satellites (Figure 8).At the same time, the errors from this network (> 0.27 dex) imply that, for halos with  vir > 10 11.7  ⊙ , adding orbital information for neighboring halos reduces the variance in predicted masses by > 90%.Hence, although  max,sat is helpful to establish a broad prior on host halo mass, most of the information leading to the final predicted mass for MW-mass and larger halos is coming from neighboring halos' orbits.
Since we have shown that nearby massive halos impact neighboring halos' orbital distributions, we also consider a network trained on isolated halos (Fig. 9).Since the Milky Way and M31 are ∼ 11 Mpc/ℎ from the Virgo Cluster (e.g., Mei et al. 2007), we trained a separate network using only halos with  14 > 10 Mpc/ℎ.This network performed only marginally better (0.113 dex vs.0.118 dex errors for 10 12  ⊙ Fig. 6.-Left:There is a strong correlation between the maximum circular velocity of the most massive satellite and the host halo mass.The color coding indicates the distance to the nearest larger halo, which is also correlated with host halo mass, but more weakly than the maximum circular velocity of the most massive satellite.Right: This plot shows the correlation between the distance to the nearest larger halo and the host halo mass.Larger halos are less prevalent, which leads to larger distances between them when compared to smaller halos.So, the distribution of larger halos contributes to a distinct pattern for  larger , different from that of  max,sat .There is a noticeable kink in the median relation between  larger and halo mass around  vir ∼ 10 11  ⊙ .This kink arises due to our selection criteria, where we focus on halos with a minimum of 30 neighbors within a 200 kpc radius.Fig. 7.-There is little correlation between the distance to the nearest massive halo and the host halo mass.Halos of all masses can be found near massive halos, which in turn can significantly impact orbital properties of their neighboring halos.halos) than the network with no selection on  14 , suggesting that the network with no selection is nonetheless able to compensate well for the presence of a larger nearby halo.

DISCUSSION AND CONCLUSIONS
We find that applying a neural network with information from neighboring halo orbits can place tight constraints on the masses of Milky Way-like halos, with typical errors less than 0.12 dex.In our analysis, using information from 30 neighboring galaxies yields more accurate predictions of central halo masses compared to using only 10 neighboring galaxies, for which the uncertainties rise to ∼ 0.17 dex.This finding is consistent with the result reported by Patel et al. (2018b), in that incorporating specific angular momenta as input variables allows for tight constraints in predicting central halo masses.
Our approach offers several advantages over previous methods, addressing certain limitations and paving the way for future advancements.First, we have shown that it is not necessary to assume dynamical equilibrium or to assume satellite status to achieve tight constraints on halo masses, at least for halos with enough nearby satellites.Secondly, past simulation-based methods, such as those employed in Patel et al. (2018b) and others' previous works, may have slightly underestimated errors due to correlations between satellite orbits, regardless of whether the measurement errors are included or not.In our case, we find that going from 10 satellites to 30 satellites gives a factor of √ 2 improvement in uncertainties, whereas Poisson statistics would suggest a factor of √ 3. Part of the barrier in achieving lower (Poisson-limited) uncertainties could be due to correlations between satellite orbits, such as satellites arriving along the same filament.However, part of the barrier could also be limitations in characterizing the environment.For example, we showed that nearby highmass halos cause contamination in satellite orbits, but other aspects of the environment could correlate with satellite orbits Fig. 8.-This figure shows a neural network trained on just three features, the distance to the nearest halo with  vir > 10 14  ⊙ ( 14 ), the distance to the nearest larger halo ( larger ), and the maximum circular velocity of the most massive satellite ( max,sat ).The uncertainties are now more similar across halo masses, with typical values of 0.271 dex at halo masses of 10 12  ⊙ .This suggests that, while helpful,  max,sat does not primarily determine halo mass, but instead most of the information leading to lower uncertainties in Fig. 3 is coming from the orbits of neighboring halos. in as yet unexplored/unknown ways.This study did not investigate the impact of observational errors beyond the fiducial observational error model in Appendix A, in part because we wished to understand the maximal amount of information present in satellite orbits.For a study that is applicable to the Milky Way and/or M31 systems, one would need to account for observational errors that correlate with heliocentric distance and other factors.This is the next planned step in our paper series, which will involve training a neural network on simulations with realistic observational errors and then using the resulting network to measure the masses of the Milky Way and Andromeda.Furthermore, our current work, similar to many previous studies, did not extensively test the method on hydrodynamical simulations.We recognize the importance of investigating the effectiveness of our approach on non-dark matter-only simulations, and we also plan to perform such tests.In particular, we plan to cross-validate the method by training on one hydrodynamical simulation and testing on another hydrodynamical simulation with a different physics implementation.
Beyond halo mass, we also plan to train new neural networks to estimate additional parameters such as the halo's spin axis, concentration, and assembly history.This would provide important context to our understanding of our own halo, including orbit modeling for satellites, as present halo models tend to assume a static mass and concentration history for the Milky Way.  3, these figures show the predicted halo mass versus actual halo mass when adding reasonable observational errors to our model.The input features fed into the networks correspond to observables, primarily focusing on the specific angular momenta and other orbital properties of neighboring galaxies.In the comparison between the left and right figures, the left figure displays outcomes from halos with a minimum of 30 neighboring galaxies, exhibiting reduced errors compared to the right figure, which considers halos with a minimum of 10 neighboring galaxies.The bottom panels illustrate the root mean square error (RMSE) versus the actual halo mass.The blue curve represents the addition of fiducial observational errors, while the simulation data without any observational errors is depicted by the orange curve.Notably, the introduction of fiducial observational errors does not substantially alter the results.For networks utilizing 10 neighbors, typical errors amount to approximately 0.199 dex for Milky Way-mass halos, whereas for networks utilizing 30 neighbors, these errors are about 0.135 dex.Error bars depict the standard deviations of predicted halo masses relative to actual halo masses, while the black line indicates the medians of predicted halo masses within bins of actual halo masses.The gray bar at 10 12  ⊙ corresponds to the approximate mass of the Milky Way.Finally, the red line serves as a reference, indicating where the predicted mass would align with the actual mass.
• 30 km s −1 to satellite velocities, • 5% relative error to the distance to the nearest larger halo, • 5% relative error to the distance to the nearest 10 14  ⊙ halo, and • 10% error to the  max of the most massive satellite.
Our expected performance is very similar when these approximate observational errors are included (it increased from 0.118 dex to 0.135 dex at 10 12  ⊙ ), which could be due to multiple possibilities.One is that the expected observational errors are small relative to the intrinsic scatter in satellite properties across different halos, and another is that correlations between satellite orbits are partially mitigating the effect of scatter (i.e., in that the same information is present in multiple satellites, and so is more robust to the presence of noise).The fact that the relative errors increased more for the 10-neighbor network (∼ 20%) compared to the 30-neighbor network (∼ 14%) suggests that this latter effect is present to some extent.However, the fact that the increase is relatively low for both networks suggests that the observational errors are small relative to halo-to-halo dispersion.This result gives us confidence that we can achieve a performance comparable to our model's ideal performance when applied to real data.
Additionally, it is important to highlight that the properties selected for our analysis ( , , ) show weak correlations with the position or sky location in relation to the Milky Way.This is important because real observations of satellites do not have consistent completeness across the full sky.We would expect that the distribution of the chosen properties would not depend on sky coverage, but a full test of this would require a careful combination of many different surveys' completeness maps for the Milky Way, beyond the scope of this paper.This paper was built using the Open Journal of Astrophysics L A T E X template.The OJA is a journal which provides fast and easy peer review for new papers in the astro-ph section of the arXiv, making the reviewing process simpler for authors and referees alike.Learn more at http://astro.theoj.org.

Fig. 1 .
Fig. 1.-The average specific angular momenta of the 30 largest satellites (selected by highest peak  max ) versus central halo mass, for dark matter halos in the VSMDPL simulation.The expected dependence on halo mass (  ∝  2/3 ℎ ) is shown by the red line, which is generally tightly followed by the simulated halos.

Fig. 2 .
Fig.2.-The neural network geometry we use to predict halo masses.Input features include neighboring halos' specific angular momenta ( ), radial distances (), and relative velocities (), as well as the maximum circular velocity of the most massive satellite ( max,sat ), the distance to the nearest larger halo ( larger ), and the distance to the nearest halo with  vir > 10 14  ⊙ ( 14 ).For all networks (regardless of the number of inputs), there are 5 hidden layers gradually decreasing from 10 nodes to 2 nodes, with one output layer corresponding to the predicted halo mass.

Fig
Fig.4.-Left: the median specific angular momentum of neighboring halos as a function of halo mass, for halos that have at least 30 neighbors within 200 kpc.Halos are color-coded by the most massive satellite's maximum circular velocity, which correlates with host halo mass.Here, the neighbors of high-mass halos are much more likely to be satellites and thus have orbits with correlated specific angular momenta.In contrast, low-mass halos usually have non-satellite neighbors, which are less influenced by the low-mass halo's presence.So, the distributions of neighbors' specific angular momenta are much more correlated with halo mass for high-mass than low-mass halos.Right: the median specific angular momentum of halos' neighbors, now color-coded by the distance to the nearest massive halo ( ℎ > 10 14  ⊙ ).Gravitational forces from high-mass halos impact the orbits of all nearby halos, leading to higher relative velocities between low-mass halos and their neighbors.Moreover, massive halos have satellites that possess high velocities, and as these satellites' orbits can extend beyond the virial radii of the massive halos, they can pass nearby other lower-mass halos even as they have very large specific angular momentum offsets.Hence, the largest median specific angular momenta typically occur near massive halos.

Fig. 5 .
Fig. 5.-Left: the median relative velocities of neighboring halos as a function of halo mass, for those halos with 30 neighbors within 200 kpc from their centers.Halos are color-coded by the most massive satellite's maximum circular velocity, which correlates with host halo mass.Here, the neighbors of high-mass halos are much more likely to be satellites and thus have orbits with correlated relative velocities.In contrast, low-mass halos usually have non-satellite neighbors.As in Fig.4, the distributions of neighbors' relative velocities are much more correlated with halo mass for high-mass than low-mass halos.Right: median relative velocities of halos' neighbors, now color-coded by the distance to the nearest massive halo ( ℎ > 10 14  ⊙ ).As in Fig.4, the largest median neighbor relative velocities typically occur near massive halos.

Fig. 9 .
Fig. 9.-This figure shows a neural network trained only on halos that are more than 10 Mpc/ℎ away from the nearest 10 14  ⊙ halo (similar to the Milky Way and M31, which are ∼ 11 Mpc/ℎ away from the Virgo Cluster; Mei et al. 2007).The uncertainties are very modestly lower (0.113 dex instead of 0.118 dex) at a halo mass of 10 12  ⊙ .

Fig. 10 .
Fig.10.-Analogous to Figure3, these figures show the predicted halo mass versus actual halo mass when adding reasonable observational errors to our model.The input features fed into the networks correspond to observables, primarily focusing on the specific angular momenta and other orbital properties of neighboring galaxies.In the comparison between the left and right figures, the left figure displays outcomes from halos with a minimum of 30 neighboring galaxies, exhibiting reduced errors compared to the right figure, which considers halos with a minimum of 10 neighboring galaxies.The bottom panels illustrate the root mean square error (RMSE) versus the actual halo mass.The blue curve represents the addition of fiducial observational errors, while the simulation data without any observational errors is depicted by the orange curve.Notably, the introduction of fiducial observational errors does not substantially alter the results.For networks utilizing 10 neighbors, typical errors amount to approximately 0.199 dex for Milky Way-mass halos, whereas for networks utilizing 30 neighbors, these errors are about 0.135 dex.Error bars depict the standard deviations of predicted halo masses relative to actual halo masses, while the black line indicates the medians of predicted halo masses within bins of actual halo masses.The gray bar at 10 12  ⊙ corresponds to the approximate mass of the Milky Way.Finally, the red line serves as a reference, indicating where the predicted mass would align with the actual mass.