Astrometric epoch propagation

Astrometric epoch propagation#

This notebook demonstrates how to improve crossmatches between LSST DP1 objects and Gaia DR3 by accounting for stellar motion over time (epoch propagation).

Epoch propagation updates a source’s sky position from one observation epoch to another using its proper motion (and, when available, parallax and radial velocity). This should not be confused with the Equinox of the coordinate system: both catalogs use the ICRS frame with Equinox J2000.

Run this notebook on the Rubin Science Platform (or another environment where DP1 is available and LSDB is pre-installed). For DP1 access details, see: https://lsdb.io/dp1

We will:

Set up the environment and define epochs for DP1 and Gaia DR3.
Open the relevant catalogs and compute a safe search radius.
Propagate Gaia source coordinates from the Gaia epoch to the DP1 epoch.
Crossmatch DP1 objects with the propagated Gaia positions and select the best matches.
Compare against a baseline crossmatch that does not use propagation.
Visualize the difference between the two approaches.

[1]:

import lsdb
import numpy as np
from astropy import units
from astropy.coordinates import Distance, ICRS, SkyCoord
from astropy.time import Time
from dask.distributed import Client
from upath import UPath

[2]:

print(f"{lsdb.__version__ = }")

lsdb.__version__ = '0.7.3'

1. Define epochs#

We specify the observation epochs for LSST DP1 (approximately late 2024) and Gaia DR3 (2016.0). These are used to translate Gaia source positions from the Gaia epoch to the DP1 epoch.

[3]:

# DP1 was mostly October 2024, we use an approximate value here for simplicity
dp1_epoch = 2024.9
# https://gaia.aip.de/metadata/gaiadr3/
gaia_dr3_epoch = 2016.0

2. Open catalogs and gather configuration#

Open the DP1 object catalog and retrieve its matching margin. This margin helps us decide how far to search around each DP1 source when looking for Gaia counterparts. Here we use Rubin Science Platform path to DP1, please adjust as needed; see https://lsdb.io/dp1 for details.

[4]:

base_path = UPath("/rubin/lsdb_data")
dp1_object = lsdb.open_catalog(
    base_path / "object_collection",
    columns=[],
)
dp1_margin_arcsec = dp1_object.margin.hc_structure.catalog_info.margin_threshold

3. Derive a proper-motion filter#

Compute a maximum allowed proper motion based on the DP1 margin and the time difference between the two epochs. This helps us restrict Gaia sources to those that could plausibly still fall within the DP1 search area after propagation.

[5]:

max_pm = 1000.0 * dp1_margin_arcsec / (dp1_epoch - gaia_dr3_epoch)
max_pm

[5]:

561.797752808983

4. Load a filtered Gaia DR3 subset#

Open a Gaia DR3 catalog with the columns needed for propagation and filter out entries lacking required motion or distance information. For S3 access we use MAST’s open buckets; see https://data.lsdb.io/#Gaia/Gaia_DR3_(US-East,_S3) for details.

[6]:

gaia_dr3 = lsdb.open_catalog(
    "s3://stpubdata/gaia/gaia_dr3/public/hats",
    columns=["pmra", "pmdec", "radial_velocity", "parallax"],
    filters=[("pm", "<", max_pm), ("parallax", ">", 0.0)],
).query("radial_velocity.notna()")

5. Define epoch propagation#

Create a helper that takes Gaia source attributes at the Gaia epoch and computes their positions at the DP1 epoch. We leverage Astropy to carry out the space‑motion propagation.

[7]:

def propagate_epoch(df, *, start, end):
    if len(df) == 0:
        return df.assign(ra_corr=np.array([], dtype=float), dec_corr=np.array([], dtype=float))

    start_coord = SkyCoord(
        ra=df["ra"].to_numpy() * units.deg,
        dec=df["dec"].to_numpy() * units.deg,
        pm_ra_cosdec=df["pmra"].to_numpy() * units.marcsec / units.year,
        pm_dec=df["pmdec"].to_numpy() * units.marcsec / units.year,
        radial_velocity=df["radial_velocity"].to_numpy() * units.km / units.s,
        distance=Distance(parallax=df["parallax"].to_numpy() * units.marcsec),
        obstime=Time(start, format="byear"),
        frame="icrs",
    )
    end_coord = start_coord.apply_space_motion(new_obstime=Time(end, format="byear"))
    df["ra_corr"] = end_coord.ra.deg
    df["dec_corr"] = end_coord.dec.deg
    return df


propagated_gaia = gaia_dr3.map_partitions(propagate_epoch, start=gaia_dr3_epoch, end=dp1_epoch)
propagated_gaia

[7]:

lsdb Catalog gaia:

	pmra	pmdec	radial_velocity	parallax	ra	dec	ra_corr	dec_corr
npartitions=2016
Order: 2, Pixel: 0	double[pyarrow]	double[pyarrow]	float[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	float64	float64
Order: 2, Pixel: 1	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...
Order: 3, Pixel: 766	...	...	...	...	...	...	...	...
Order: 3, Pixel: 767	...	...	...	...	...	...	...	...

8 out of 152 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

6. Crossmatch propagated Gaia to DP1 and select the best match#

This is a two‑step crossmatching pipeline:

We search with the largest radius that still preserves correct matches — the DP1 catalog’s margin defines this “safe” maximum. Using this margin ensures candidates that truly belong together are not missed.
We intentionally keep multiple nearby Gaia candidates per DP1 object at first. After that, we compare candidates and choose the single best one using their separation. This approach lets us propagate to the target epoch and then make an informed choice among several plausible neighbors.

[8]:

xmatch = dp1_object.crossmatch_nested(propagated_gaia, radius_arcsec=dp1_margin_arcsec, n_neighbors=20)
xmatch

[8]:

lsdb Catalog object_lc_x_gaia:

	coord_ra	coord_dec	gaia
npartitions=389
Order: 6, Pixel: 130	double[pyarrow]	double[pyarrow]	nested<pmra: [double], pmdec: [double], radial...
Order: 8, Pixel: 2176	...	...	...
...	...	...	...
Order: 9, Pixel: 2302101	...	...	...
Order: 7, Pixel: 143884	...	...	...

3 out of 3 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

For each DP1 object, we then choose the closest propagated Gaia candidate within a small tolerance. This reduces multiple candidates to a single best match when possible.

[9]:

def select_best_match(dp1_ra, dp1_dec, gaia_ra, gaia_dec, *, max_dist_arcsec):
    dp1_coord = SkyCoord(ra=dp1_ra * units.deg, dec=dp1_dec * units.deg)
    gaia_coord = SkyCoord(ra=gaia_ra * units.deg, dec=gaia_dec * units.deg)
    distance = gaia_coord.separation(dp1_coord)
    idx_close = np.where(distance < max_dist_arcsec * units.arcsec)[0]
    if len(idx_close) == 0:
        return {"gaia_ra_corr": np.nan, "gaia_dec_corr": np.nan, "distance": np.nan}
    best_idx = min(idx_close, key=distance.__getitem__)
    return {
        "gaia_ra_corr": gaia_ra[best_idx],
        "gaia_dec_corr": gaia_dec[best_idx],
        "distance": distance[best_idx].to_value(units.arcsec),
    }


best_matches = (
    xmatch.map_rows(
        select_best_match,
        columns=["coord_ra", "coord_dec", "gaia.ra_corr", "gaia.dec_corr"],
        max_dist_arcsec=0.1,
        append_columns=True,
        row_container="args",
        meta=dict.fromkeys(["gaia_ra_corr", "gaia_dec_corr", "distance"], float),
    )
    .map_partitions(lambda df: df.drop(columns=["gaia"]))
    .query("distance.notna()")
)
best_matches

[9]:

lsdb Catalog object_lc_x_gaia:

	coord_ra	coord_dec	gaia_ra_corr	gaia_dec_corr	distance
npartitions=389
Order: 6, Pixel: 130	double[pyarrow]	double[pyarrow]	float64	float64	float64
Order: 8, Pixel: 2176	...	...	...	...	...
...	...	...	...	...	...
Order: 9, Pixel: 2302101	...	...	...	...	...
Order: 7, Pixel: 143884	...	...	...	...	...

5 out of 5 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

7. Baseline: crossmatch without propagation#

As a point of comparison, perform a standard positional crossmatch using the Gaia positions at their original epoch (no propagation).

[10]:

naive_matches = dp1_object.crossmatch(gaia_dr3, radius_arcsec=0.1, suffix_method="overlapping_columns")

8. Compute results with a local Dask client#

All steps above are lazily evaluated. Here we actually run both pipelines — with propagation and without — and collect the results into Pandas DataFrames. We also print the number of matches from each pipeline.

[11]:

with Client(n_workers=4, threads_per_worker=1) as client:
    display(client)
    best_matches_df = best_matches.compute()
    print(f"Propagated matches: {len(best_matches_df)}")
    naive_matches_df = naive_matches.compute()
    print(f"Non-propagated matches: {len(naive_matches_df)}")

distributed.scheduler INFO: State start

distributed.scheduler INFO:   Scheduler at:     tcp://127.0.0.1:46535

distributed.scheduler INFO:   dashboard at:  https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/8787/status

distributed.scheduler INFO: Registering Worker plugin shuffle

distributed.nanny INFO:         Start Nanny at: 'tcp://127.0.0.1:37929'

distributed.nanny INFO:         Start Nanny at: 'tcp://127.0.0.1:32893'

distributed.nanny INFO:         Start Nanny at: 'tcp://127.0.0.1:39689'

distributed.nanny INFO:         Start Nanny at: 'tcp://127.0.0.1:37729'

distributed.scheduler INFO: Register worker addr: tcp://127.0.0.1:44929 name: 0

distributed.scheduler INFO: Starting worker compute stream, tcp://127.0.0.1:44929

distributed.core INFO: Starting established connection to tcp://127.0.0.1:52778

distributed.scheduler INFO: Register worker addr: tcp://127.0.0.1:46017 name: 1

distributed.scheduler INFO: Starting worker compute stream, tcp://127.0.0.1:46017

distributed.core INFO: Starting established connection to tcp://127.0.0.1:52782

distributed.scheduler INFO: Register worker addr: tcp://127.0.0.1:35485 name: 2

distributed.scheduler INFO: Starting worker compute stream, tcp://127.0.0.1:35485

distributed.core INFO: Starting established connection to tcp://127.0.0.1:52792

distributed.scheduler INFO: Register worker addr: tcp://127.0.0.1:44707 name: 3

distributed.scheduler INFO: Starting worker compute stream, tcp://127.0.0.1:44707

distributed.core INFO: Starting established connection to tcp://127.0.0.1:52806

distributed.scheduler INFO: Receive client connection: Client-08256a2e-da07-11f0-827e-af10543ff312

distributed.core INFO: Starting established connection to tcp://127.0.0.1:52808

Client

Client-08256a2e-da07-11f0-827e-af10543ff312

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/8787/status

Cluster Info

LocalCluster

15ba409d

Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/8787/status	Workers: 4
Total threads: 4	Total memory: 16.00 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-d9cd6483-becd-4cd3-b6df-1656560ab186

Comm: tcp://127.0.0.1:46535	Workers: 0
Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Worker: 0

Comm: tcp://127.0.0.1:44929	Total threads: 1
Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/41041/status	Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:37929
Local directory: /tmp/dask-scratch-space/worker-5jwwnhta

Worker: 1

Comm: tcp://127.0.0.1:46017	Total threads: 1
Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/35583/status	Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:32893
Local directory: /tmp/dask-scratch-space/worker-9xafhnm1

Worker: 2

Comm: tcp://127.0.0.1:35485	Total threads: 1
Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/45399/status	Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:39689
Local directory: /tmp/dask-scratch-space/worker-ijjn4t4_

Worker: 3

Comm: tcp://127.0.0.1:44707	Total threads: 1
Dashboard: https://malanchev.nb.data.lsst.cloud/nb/user/malanchev/proxy/46427/status	Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:37729
Local directory: /tmp/dask-scratch-space/worker-nl9b83jc

Propagated matches: 1492
Non-propagated matches: 957

distributed.scheduler INFO: Remove client Client-08256a2e-da07-11f0-827e-af10543ff312

distributed.core INFO: Received 'close-stream' from tcp://127.0.0.1:52808; closing.

distributed.scheduler INFO: Remove client Client-08256a2e-da07-11f0-827e-af10543ff312

distributed.scheduler INFO: Close client connection: Client-08256a2e-da07-11f0-827e-af10543ff312

distributed.scheduler INFO: Retire worker addresses (stimulus_id='retire-workers-1765838977.415765') (0, 1, 2, 3)

distributed.nanny INFO: Closing Nanny at 'tcp://127.0.0.1:37929'. Reason: nanny-close

distributed.nanny INFO: Nanny asking worker to close. Reason: nanny-close

distributed.nanny INFO: Closing Nanny at 'tcp://127.0.0.1:32893'. Reason: nanny-close

distributed.nanny INFO: Nanny asking worker to close. Reason: nanny-close

distributed.nanny INFO: Closing Nanny at 'tcp://127.0.0.1:39689'. Reason: nanny-close

distributed.nanny INFO: Nanny asking worker to close. Reason: nanny-close

distributed.nanny INFO: Closing Nanny at 'tcp://127.0.0.1:37729'. Reason: nanny-close

distributed.nanny INFO: Nanny asking worker to close. Reason: nanny-close

distributed.core INFO: Received 'close-stream' from tcp://127.0.0.1:52782; closing.

distributed.core INFO: Received 'close-stream' from tcp://127.0.0.1:52792; closing.

distributed.scheduler INFO: Remove worker addr: tcp://127.0.0.1:46017 name: 1 (stimulus_id='handle-worker-cleanup-1765838977.4288983')

distributed.scheduler INFO: Remove worker addr: tcp://127.0.0.1:35485 name: 2 (stimulus_id='handle-worker-cleanup-1765838977.4302118')

distributed.core INFO: Received 'close-stream' from tcp://127.0.0.1:52806; closing.

distributed.scheduler INFO: Remove worker addr: tcp://127.0.0.1:44707 name: 3 (stimulus_id='handle-worker-cleanup-1765838977.4333227')

distributed.core INFO: Received 'close-stream' from tcp://127.0.0.1:52778; closing.

distributed.scheduler INFO: Remove worker addr: tcp://127.0.0.1:44929 name: 0 (stimulus_id='handle-worker-cleanup-1765838977.4580626')

distributed.scheduler INFO: Lost all workers

distributed.nanny INFO: Nanny at 'tcp://127.0.0.1:37729' closed.

distributed.nanny INFO: Nanny at 'tcp://127.0.0.1:37929' closed.

distributed.nanny INFO: Nanny at 'tcp://127.0.0.1:32893' closed.

distributed.nanny INFO: Nanny at 'tcp://127.0.0.1:39689' closed.

distributed.scheduler INFO: Closing scheduler. Reason: unknown

distributed.scheduler INFO: Scheduler closing all comms

9. Plot separation distribution#

Plot histograms of match separations for both approaches and highlight the effect of propagation. In this run, the propagated pipeline yields more matches (1492 vs 957), and the separation distribution is tighter (more matches at smaller separations).

[12]:

import matplotlib.pyplot as plt

plt.hist(
    naive_matches_df["_dist_arcsec"],
    bins=np.linspace(0, 0.1, 51),
    histtype="stepfilled",
    alpha=0.2,
    label=f"non-propagated, N={len(naive_matches_df)}",
)
plt.hist(
    best_matches_df["distance"],
    bins=np.linspace(0, 0.1, 51),
    histtype="stepfilled",
    alpha=0.3,
    label=f"propagated, N={len(best_matches_df)}",
)
plt.legend()

[12]:

<matplotlib.legend.Legend at 0x7f78a49b41a0>

../../_images/tutorials_pre_executed_dp1-gaia-epoch-prop_22_1.png

[ ]: