Intro to Rubin Catalog Operations

Intro to Rubin Catalog Operations#

In this tutorial, we will:

explore the basics of column discovery
work with an individual lightcurve
apply common filtering operations
calculate basic aggregation statistics

Introduction#

This tutorial showcases a handful of basic LSDB operations that should be useful when working with Rubin (DP1) data. These operations are likely to be used regardless of science case, and the particular examples in this tutorial should allow you to understand how to use these operations in other ways. For example, while we filter by photometric band in one of the example below, that filter can easily be modified to filter by a quality flag in the data.

Loading Data#

The details of loading Rubin data are discussed in How to Access Data, so we’ll just provide a starter codeblock below:

[1]:

from upath import UPath
import lsdb
from lsdb import ConeSearch

base_path = UPath("/rubin/lsdb_data")
object_collection = lsdb.open_catalog(base_path / "object_collection")

# Cone search on ECDFS (Extended Chandra Deep Field South)
object_collection = lsdb.open_catalog(
    base_path / "dia_object_collection",
    search_filter=ConeSearch(ra=52.838, dec=-28.279, radius_arcsec=5000),
)
object_collection

[1]:

lsdb Catalog dia_object_lc:

	dec	diaObjectId	nDiaSources	ra	radecMjdTai	tract	diaObjectForcedSource	diaSource
npartitions=42
Order: 10, Pixel: 9198847	double[pyarrow]	int64[pyarrow]	int64[pyarrow]	double[pyarrow]	double[pyarrow]	int64[pyarrow]	nested<band: [string], coord_dec: [double], co...	nested<band: [string], centroid_flag: [bool], ...
Order: 6, Pixel: 35933	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...
Order: 7, Pixel: 143825	...	...	...	...	...	...	...	...
Order: 6, Pixel: 35957	...	...	...	...	...	...	...	...

8 out of 140 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

As mentioned beneath the catalog dataframe, the view above is a “lazy” view of the data. Often, it’s nice to preview the first few rows to better understand the contents of the dataset:

[2]:

object_collection.head(5)

[2]:

dec

diaObjectId

nDiaSources

ra

radecMjdTai

tract

diaObjectForcedSource

diaSource

2528559836917790087

-28.694949

609787430777651220

1

53.446451

60632.084896

4849

band	coord_dec	...	psfMagErr	visit
z	-28.694949	...	0.022372	2024111700144
+0 rows	...	...	...	...

band	centroid_flag	...	y	yErr
z	False	...	3889.319548	0.400396
+0 rows	...	...	...	...

2528559862939733327

-28.688946

609787430777651219

1

53.439723

60632.084896

4849

band	coord_dec	...	psfMagErr	visit
z	-28.688946	...	0.012165	2024111700144
+0 rows	...	...	...	...

band	centroid_flag	...	y	yErr
z	False	...	3825.048257	0.086882
+0 rows	...	...	...	...

2528559881657338699

-28.688324

609787362058174465

1

53.459674

60632.084896

4849

band	coord_dec	...	psfMagErr	visit
z	-28.688325	...	0.005859	2024111700144
+0 rows	...	...	...	...

band	centroid_flag	...	y	yErr
z	False	...	3705.421866	NaN
+0 rows	...	...	...	...

2528560064243227454

-28.648182

609788117972418767

1

53.429821

60632.073989

4849

band	coord_dec	...	psfMagErr	visit
r	-28.648182	...	0.002465	2024111700129
+5 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	False	...	3793.330937	NaN
+0 rows	...	...	...	...

2528606911361164817

-28.630057

609788049252941917

1

53.505678

60632.084896

4849

band	coord_dec	...	psfMagErr	visit
z	-28.630057	...	NaN	2024111200302
+5 rows	...	...	...	...

band	centroid_flag	...	y	yErr
z	False	...	2472.350151	0.377441
+0 rows	...	...	...	...

5 rows x 8 columns

Viewing Available Columns#

The schema browser provides the most information regarding available (DP1) columns, there is also a handful of properties useful for quick column discovery within the LSDB API. First, all_columns gives a view of all available columns in the HATS catalog, even if only a handful of columns were selected on load:

[3]:

object_collection.all_columns

[3]:

['ra',
 'dec',
 'nDiaSources',
 'radecMjdTai',
 'g_psfFluxLinearSlope',
 'g_psfFluxLinearIntercept',
 'g_psfFluxMAD',
 'g_psfFluxMaxSlope',
 'g_psfFluxErrMean',
 'g_psfFluxMean',
 'g_psfFluxMeanErr',
 'g_psfFluxNdata',
 'g_scienceFluxMean',
 'g_scienceFluxMeanErr',
 'g_psfFluxMin',
 'g_psfFluxMax',
 'g_psfFluxPercentile05',
 'g_psfFluxPercentile25',
 'g_psfFluxPercentile50',
 'g_psfFluxPercentile75',
 'g_psfFluxPercentile95',
 'g_psfFluxSigma',
 'g_scienceFluxSigma',
 'g_psfFluxSkew',
 'g_psfFluxChi2',
 'g_psfFluxStetsonJ',
 'r_psfFluxLinearSlope',
 'r_psfFluxLinearIntercept',
 'r_psfFluxMAD',
 'r_psfFluxMaxSlope',
 'r_psfFluxErrMean',
 'r_psfFluxMean',
 'r_psfFluxMeanErr',
 'r_psfFluxNdata',
 'r_scienceFluxMean',
 'r_scienceFluxMeanErr',
 'r_psfFluxMin',
 'r_psfFluxMax',
 'r_psfFluxPercentile05',
 'r_psfFluxPercentile25',
 'r_psfFluxPercentile50',
 'r_psfFluxPercentile75',
 'r_psfFluxPercentile95',
 'r_psfFluxSigma',
 'r_scienceFluxSigma',
 'r_psfFluxSkew',
 'r_psfFluxChi2',
 'r_psfFluxStetsonJ',
 'u_psfFluxLinearSlope',
 'u_psfFluxLinearIntercept',
 'u_psfFluxMAD',
 'u_psfFluxMaxSlope',
 'u_psfFluxErrMean',
 'u_psfFluxMean',
 'u_psfFluxMeanErr',
 'u_psfFluxNdata',
 'u_scienceFluxMean',
 'u_scienceFluxMeanErr',
 'u_psfFluxMin',
 'u_psfFluxMax',
 'u_psfFluxPercentile05',
 'u_psfFluxPercentile25',
 'u_psfFluxPercentile50',
 'u_psfFluxPercentile75',
 'u_psfFluxPercentile95',
 'u_psfFluxSigma',
 'u_scienceFluxSigma',
 'u_psfFluxSkew',
 'u_psfFluxChi2',
 'u_psfFluxStetsonJ',
 'i_psfFluxLinearSlope',
 'i_psfFluxLinearIntercept',
 'i_psfFluxMAD',
 'i_psfFluxMaxSlope',
 'i_psfFluxErrMean',
 'i_psfFluxMean',
 'i_psfFluxMeanErr',
 'i_psfFluxNdata',
 'i_scienceFluxMean',
 'i_scienceFluxMeanErr',
 'i_psfFluxMin',
 'i_psfFluxMax',
 'i_psfFluxPercentile05',
 'i_psfFluxPercentile25',
 'i_psfFluxPercentile50',
 'i_psfFluxPercentile75',
 'i_psfFluxPercentile95',
 'i_psfFluxSigma',
 'i_scienceFluxSigma',
 'i_psfFluxSkew',
 'i_psfFluxChi2',
 'i_psfFluxStetsonJ',
 'z_psfFluxLinearSlope',
 'z_psfFluxLinearIntercept',
 'z_psfFluxMAD',
 'z_psfFluxMaxSlope',
 'z_psfFluxErrMean',
 'z_psfFluxMean',
 'z_psfFluxMeanErr',
 'z_psfFluxNdata',
 'z_scienceFluxMean',
 'z_scienceFluxMeanErr',
 'z_psfFluxMin',
 'z_psfFluxMax',
 'z_psfFluxPercentile05',
 'z_psfFluxPercentile25',
 'z_psfFluxPercentile50',
 'z_psfFluxPercentile75',
 'z_psfFluxPercentile95',
 'z_psfFluxSigma',
 'z_scienceFluxSigma',
 'z_psfFluxSkew',
 'z_psfFluxChi2',
 'z_psfFluxStetsonJ',
 'y_psfFluxLinearSlope',
 'y_psfFluxLinearIntercept',
 'y_psfFluxMAD',
 'y_psfFluxMaxSlope',
 'y_psfFluxErrMean',
 'y_psfFluxMean',
 'y_psfFluxMeanErr',
 'y_psfFluxNdata',
 'y_scienceFluxMean',
 'y_scienceFluxMeanErr',
 'y_psfFluxMin',
 'y_psfFluxMax',
 'y_psfFluxPercentile05',
 'y_psfFluxPercentile25',
 'y_psfFluxPercentile50',
 'y_psfFluxPercentile75',
 'y_psfFluxPercentile95',
 'y_psfFluxSigma',
 'y_scienceFluxSigma',
 'y_psfFluxSkew',
 'y_psfFluxChi2',
 'y_psfFluxStetsonJ',
 'diaObjectId',
 'tract',
 'diaSource',
 'diaObjectForcedSource']

Any nested columns (see Understanding the NestedFrame for an explanation on what these are) will have their own sets of sub-columns as well, we can first identify any nested columns programmatically using the nested_columns property:

[4]:

object_collection.nested_columns

[4]:

['diaObjectForcedSource', 'diaSource']

To view the available sub-columns, we use the nest accessor for one of the nested columns:

[5]:

object_collection["diaObjectForcedSource"].columns

[5]:

['band',
 'coord_dec',
 'coord_ra',
 'diff_PixelFlags_nodataCenter',
 'forcedSourceOnDiaObjectId',
 'invalidPsfFlag',
 'midpointMjdTai',
 'pixelFlags_bad',
 'pixelFlags_cr',
 'pixelFlags_crCenter',
 'pixelFlags_edge',
 'pixelFlags_interpolated',
 'pixelFlags_interpolatedCenter',
 'pixelFlags_nodata',
 'pixelFlags_saturated',
 'pixelFlags_saturatedCenter',
 'pixelFlags_suspect',
 'pixelFlags_suspectCenter',
 'psfDiffFlux',
 'psfDiffFlux_flag',
 'psfDiffFluxErr',
 'psfFlux',
 'psfFlux_flag',
 'psfFluxErr',
 'psfMag',
 'psfMagErr',
 'visit']

Viewing a Single Lightcurve#

Selecting a single lightcurve is most effectively done via the id_search function, in this case we have a particular “diaObjectId” in mind:

[6]:

objectid = 609782208097419314
single_id = object_collection.id_search(values={"diaObjectId": objectid}).compute()
single_id

Searching for diaObjectId=609782208097419314

[6]:

dec

diaObjectId

nDiaSources

ra

radecMjdTai

tract

diaObjectForcedSource

diaSource

2528690302717534293

-28.279918

609782208097419314

371

52.838269

60655.249976

4848

band	coord_dec	...	psfMagErr	visit
r	-28.279918	...	0.014186	2024110800247
+528 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	False	...	2424.926624	0.200209
+370 rows	...	...	...	...

1 rows x 8 columns

[7]:

from matplotlib.patches import Patch
import matplotlib.pyplot as plt
import numpy as np

first_lc = single_id.diaObjectForcedSource.iloc[0]

# Compute symmetric y-limits around 0 using 95% range
flux = first_lc["psfDiffFlux"].dropna()
limit = np.percentile(np.abs(flux), 97.5) + 100
y_min, y_max = -limit, limit

# Start plot
fig, ax = plt.subplots(2, 1, figsize=(10, 10), dpi=200)

# Define band → color mapping
band_colors = {"u": "blue", "g": "green", "r": "red", "i": "orange", "z": "purple", "y": "brown"}

# Plot each band with its color
for band, color in band_colors.items():
    band_data = first_lc[first_lc["band"] == band]
    if band_data.empty:
        continue
    ax[0].errorbar(
        band_data["midpointMjdTai"],
        band_data["psfDiffFlux"],
        yerr=band_data["psfDiffFluxErr"],
        fmt="o",
        color=color,
        ecolor=color,
        elinewidth=2,
        capsize=2,
        alpha=0.8,
        markeredgecolor="k",
        label=band,
    )

    ax[1].errorbar(
        band_data["midpointMjdTai"],
        band_data["psfMag"],
        yerr=band_data["psfMagErr"],
        fmt="o",
        color=color,
        ecolor=color,
        elinewidth=2,
        capsize=2,
        alpha=0.8,
        markeredgecolor="k",
        label=band,
    )

fig.suptitle(
    f'Object ID: {single_id["diaObjectId"].values[0]} RA: {single_id["ra"].values[0]:.5f}, Dec: {single_id["dec"].values[0]:.5f}'
)

ax[0].invert_yaxis()
ax[0].set_xlabel("MJD (midpointMjdTai)")
ax[0].set_ylabel("psfDiffFlux")
# ax[0].set_title(f'Object ID: {single_id["diaObjectId"].values[0]} RA: {single_id["ra"].values[0]:.5f}, Dec: {single_id["dec"].values[0]:.5f}', fontsize=12)
ax[0].set_ylim(y_min, y_max)
ax[0].set_xlim(60622, 60658)
ax[0].grid(True)
ax[0].legend(title="Band", loc="best")

ax[1].invert_yaxis()
ax[1].set_xlabel("MJD (midpointMjdTai)")
ax[1].set_ylabel("psfMag")
# ax[1].set_title(f'Object ID: {single_id["diaObjectId"].values[0]} RA: {single_id["ra"].values[0]:.5f}, Dec: {single_id["dec"].values[0]:.5f}', fontsize=12)
# ax[1].set_ylim(y_min, y_max)
ax[1].set_xlim(60622, 60658)
ax[1].grid(True)
ax[1].legend(title="Band", loc="best")
plt.tight_layout()

../../_images/tutorials_pre_executed_using_rubin_data_14_0.png

Common Filtering Operations#

Filtering by Number of Sources#

Provided the Source table(s) haven’t been modified by any filtering operations, the “nDiaSources” column is directly provided and allows for easy filtering based on lightcurve length. Note that “nDiaSources” is just a static column that maps to the unmodified lengths of DiaSources, once the number of DiaSources is modified then the value in this column will be out of date.

[8]:

oc_long_lcs = object_collection.query("nDiaSources > 10")
oc_long_lcs.head(5)

[8]:

dec

diaObjectId

nDiaSources

ra

radecMjdTai

tract

diaObjectForcedSource

diaSource

2528607275717728666

-28.583791

609788049252941847

12

53.489935

60651.220877

4849

band	coord_dec	...	psfMagErr	visit
r	-28.583791	...	0.001425	2024110800282
+18 rows	...	...	...	...

band	centroid_flag	...	y	yErr
g	True	...	2905.496545	0.128487
+11 rows	...	...	...	...

2528607345986705737

-28.559039

609788049252941850

12

53.512649

60651.220877

4849

band	coord_dec	...	psfMagErr	visit
r	-28.559039	...	0.002026	2024110800282
+21 rows	...	...	...	...

band	centroid_flag	...	y	yErr
g	False	...	3157.335056	NaN
+11 rows	...	...	...	...

2528608083060939470

-28.541942

609788049252941840

13

53.556345

60652.209345

4849

band	coord_dec	...	psfMagErr	visit
r	-28.541942	...	0.001169	2024110800282
+15 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	True	...	3786.383057	0.041322
+12 rows	...	...	...	...

2528608127600158337

-28.532665

609788049252941831

13

53.574546

60652.209345

4849

band	coord_dec	...	psfMagErr	visit
i	-28.532665	...	0.00159	2024110800261
+13 rows	...	...	...	...

band	centroid_flag	...	y	yErr
i	False	...	3538.524552	NaN
+12 rows	...	...	...	...

2528608365089811280

-28.505685

609788736447709250

11

53.586313

60651.101656

4849

band	coord_dec	...	psfMagErr	visit
i	-28.505685	...	0.000818	2024110800261
+21 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	False	...	3596.381758	NaN
+10 rows	...	...	...	...

5 rows x 8 columns

Filtering by Photometric Band#

Another common operation is to filter by band, which can done similarly to above, but using sub-column queries:

[9]:

oc_long_lcs_g = oc_long_lcs.query("diaObjectForcedSource.band == 'g'")
oc_long_lcs_g.head(5)

[9]:

dec

diaObjectId

nDiaSources

ra

radecMjdTai

tract

diaObjectForcedSource

diaSource

2528607275717728666

-28.583791

609788049252941847

12

53.489935

60651.220877

4849

band	coord_dec	...	psfMagErr	visit
g	-28.583791	...	0.001723	2024110800283
+3 rows	...	...	...	...

band	centroid_flag	...	y	yErr
g	True	...	2905.496545	0.128487
+11 rows	...	...	...	...

2528607345986705737

-28.559039

609788049252941850

12

53.512649

60651.220877

4849

band	coord_dec	...	psfMagErr	visit
g	-28.559039	...	0.003749	2024110800283
+1 rows	...	...	...	...

band	centroid_flag	...	y	yErr
g	False	...	3157.335056	NaN
+11 rows	...	...	...	...

2528608083060939470

-28.541942

609788049252941840

13

53.556345

60652.209345

4849

band	coord_dec	...	psfMagErr	visit
g	-28.541942	...	0.001538	2024110800283
+0 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	True	...	3786.383057	0.041322
+12 rows	...	...	...	...

2528608127600158337

-28.532665

609788049252941831

13

53.574546

60652.209345

4849

None

band	centroid_flag	...	y	yErr
i	False	...	3538.524552	NaN
+12 rows	...	...	...	...

2528608365089811280

-28.505685

609788736447709250

11

53.586313

60651.101656

4849

band	coord_dec	...	psfMagErr	visit
g	-28.505685	...	0.001689	2024112800113
+3 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	False	...	3596.381758	NaN
+10 rows	...	...	...	...

5 rows x 8 columns

Note: Filtering operations on “diaObjectForcedSource” are not propagated to “diaSource”. Any filtering operations on “diaSource” should be applied in addition to any operations done on “diaObjectForcedSource”.

Filtering Empty Lightcurves#

Sometimes, filters on lightcurves may throw out all observations for certain objects, leading to empty lightcurves as seen for one of the objects above. In this case, we can filter objects with empty lightcurves using dropna:

[10]:

oc_long_lcs_g = oc_long_lcs_g.map_partitions(lambda df: df.dropna(subset=["diaObjectForcedSource"]))
oc_long_lcs_g.head(5)

[10]:

dec

diaObjectId

nDiaSources

ra

radecMjdTai

tract

diaObjectForcedSource

diaSource

2528607275717728666

-28.583791

609788049252941847

12

53.489935

60651.220877

4849

band	coord_dec	...	psfMagErr	visit
g	-28.583791	...	0.001723	2024110800283
+3 rows	...	...	...	...

band	centroid_flag	...	y	yErr
g	True	...	2905.496545	0.128487
+11 rows	...	...	...	...

2528607345986705737

-28.559039

609788049252941850

12

53.512649

60651.220877

4849

band	coord_dec	...	psfMagErr	visit
g	-28.559039	...	0.003749	2024110800283
+1 rows	...	...	...	...

band	centroid_flag	...	y	yErr
g	False	...	3157.335056	NaN
+11 rows	...	...	...	...

2528608083060939470

-28.541942

609788049252941840

13

53.556345

60652.209345

4849

band	coord_dec	...	psfMagErr	visit
g	-28.541942	...	0.001538	2024110800283
+0 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	True	...	3786.383057	0.041322
+12 rows	...	...	...	...

2528608365089811280

-28.505685

609788736447709250

11

53.586313

60651.101656

4849

band	coord_dec	...	psfMagErr	visit
g	-28.505685	...	0.001689	2024112800113
+3 rows	...	...	...	...

band	centroid_flag	...	y	yErr
r	False	...	3596.381758	NaN
+10 rows	...	...	...	...

2528608386523759343

-28.494832

609788736447709234

19

53.584151

60651.101656

4849

band	coord_dec	...	psfMagErr	visit
g	-28.494832	...	0.004753	2024112800113
+4 rows	...	...	...	...

band	centroid_flag	...	y	yErr
i	False	...	3536.879991	NaN
+18 rows	...	...	...	...

5 rows x 8 columns

Calculating Basic Statistics#

While Rubin DP1 data has many statistics pre-computed in object table column, custom computation of statistics remains broadly useful.

Simple aggregrations can be applied via the map_rows function, where below we define a very simple mean magnitude function and pass it along to each row, selecting the “psfMag” sub-column of “diaObjectForcedSource” to compute the mean of for each object.

[11]:

import numpy as np


def mean_mag(row):
    return {"mean_psfMag": np.mean(row["diaObjectForcedSource.psfMag"])}


# meta defines the expected structure of the result
# append_columns adds the result as a column to the original catalog
oc_mean_mags_g = oc_long_lcs_g.map_rows(
    mean_mag, columns=["diaObjectForcedSource.psfMag"], meta={"mean_psfMag": np.float64}, append_columns=True
)
oc_mean_mags_g.head(10)[["mean_psfMag"]]

[11]:

	mean_psfMag
_healpix_29
2528607275717728666	19.411814
2528607345986705737	19.042889
2528608083060939470	17.257166
2528608365089811280	17.539051
2528608386523759343	19.579895
2528611470051636305	19.204950
2528611708541527859	19.087742
2528611724254005370	18.638960
2528652384125757089	17.412634
2528653248294272515	18.371521

10 rows × 1 columns

In this example, we needed to define the dask “meta”, see here for a more dedicated discussion on meta.

About#

Author(s): Doug Branton

Last updated on: 20 Jan 2026

If you use lsdb for published research, please cite following instructions.

Intro to Rubin Catalog Operations

Contents

Intro to Rubin Catalog Operations#

Introduction#

Loading Data#

Viewing Available Columns#

Viewing a Single Lightcurve#

Common Filtering Operations#

Filtering by Number of Sources#

Filtering by Photometric Band#

Filtering Empty Lightcurves#

Calculating Basic Statistics#

About#