Row Filtering#

In this tutorial, we will:

  • set up a Dask client and open an object catalog

  • filter rows of data using expressions involving column values

  • preview catalog data and query results

Introduction#

When a catalog is opened, it is available for operations. However, its data is is lazily loaded, and operations on it are unrealized, until computation is called for explicitly (using the .compute() method) or implicitly, with data preview functions.

[1]:
import lsdb
from dask.distributed import Client

1. Open a catalog#

We create a basic dask client, and open an existing HATS catalog—the ZTF DR14 catalog.

Additional Help

For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.

For tips on accessing remote data, see our Accessing remote data guide

Create a basic Dask client, limiting the number of workers. This keeps subsequent operations from using more of our compute resources than we might intend, which is helpful in any case but especially when working on a shared resource.

[2]:
client = Client(n_workers=4, memory_limit="auto")
client
[2]:

Client

Client-523280ce-232f-11f1-96f6-8e6cf3dc31f8

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

[3]:
ztf_object_path = "https://data.lsdb.io/hats/ztf_dr14/ztf_object"
ztf_object = lsdb.open_catalog(ztf_object_path)
ztf_object
[3]:
lsdb Catalog ztf_dr14:
ps1_objid ra dec ps1_gMeanPSFMag ps1_rMeanPSFMag ps1_iMeanPSFMag nobs_g nobs_r nobs_i mean_mag_g mean_mag_r mean_mag_i
npartitions=2352
Order: 3, Pixel: 0 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] int32[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 3, Pixel: 1 ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 4, Pixel: 3070 ... ... ... ... ... ... ... ... ... ... ... ...
Order: 4, Pixel: 3071 ... ... ... ... ... ... ... ... ... ... ... ...
12 out of 15 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

1.1. Previewing part of the data#

Computing an entire catalog requires loading all of its resulting data into memory, which is expensive and may lead to out-of-memory issues.

Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). head() is a Pandas-like method which allows us to preview part of the data for this purpose. It iterates over the existing catalog partitions, in sequence, and finds up to n number of rows from the first partition(s) which have are able to supply those rows. Related methods include .tail() and .sample().

There is also .random_sample(), but that method fetches rows from many partitions (rather than from first qualified), and so it can be much more expensive, even while it may be more representative.

Notice that all these previewing methods implicitly call compute(), and will implicitly use the Client we created earlier.

[4]:
ztf_object.head()
[4]:
ps1_objid ra dec ps1_gMeanPSFMag ps1_rMeanPSFMag ps1_iMeanPSFMag nobs_g nobs_r nobs_i mean_mag_g mean_mag_r mean_mag_i
_healpix_29
3122680427 108000449961107213 44.996152 0.005626 18.1133 17.6278 17.4065 364 381 18 18.143208 17.569102 17.373423
24532305457 108020450140974134 45.01411 0.019775 21.565901 21.134199 21.414301 42 114 1 21.298568 21.007693 20.599543
26093200103 108010450069569645 45.007042 0.015954 22.047899 21.641199 21.655001 5 17 0 21.552179 21.405765 <NA>
29154369144 108020450049144323 45.004993 0.01991 14.8925 14.1057 13.7921 371 387 18 14.916993 14.028394 13.727368
29640602698 108020450042525676 45.004327 0.021077 19.278601 18.040701 16.853201 126 302 15 19.34495 17.897595 16.653296

5 rows × 12 columns

2. Selecting data rows by querying column values#

We can filter by column values via query().

The expression in the string given to .query() follows the same syntax accepted by Pandas’ .query(), which supports a subset of Python expressions for filtering DataFrames.

The column names that are not valid Python variables names should be wrapped in backticks, and any variable values can be injected using f-strings. The use of ‘@’ to reference variables is not supported.

More information about Pandas query strings is available here.

In the following query, we want to find objects in the catalog whose magnitude is brighter than 16:

[5]:
bright = ztf_object.query("mean_mag_i < 16")
bright
[5]:
lsdb Catalog ztf_dr14:
ps1_objid ra dec ps1_gMeanPSFMag ps1_rMeanPSFMag ps1_iMeanPSFMag nobs_g nobs_r nobs_i mean_mag_g mean_mag_r mean_mag_i
npartitions=2352
Order: 3, Pixel: 0 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] int32[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 3, Pixel: 1 ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 4, Pixel: 3070 ... ... ... ... ... ... ... ... ... ... ... ...
Order: 4, Pixel: 3071 ... ... ... ... ... ... ... ... ... ... ... ...
12 out of 15 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

We’ll use .head() for a quick sanity check to be sure that no mean_mag_i is dimmer than 16. Since it’s only a few rows, it’s not a guarantee, but it does help us to be sure that we didn’t make any obvious mistake with our query expression.

[6]:
%%time
bright.head(10)
CPU times: user 448 ms, sys: 15.9 ms, total: 464 ms
Wall time: 1.33 s
[6]:
ps1_objid ra dec ps1_gMeanPSFMag ps1_rMeanPSFMag ps1_iMeanPSFMag nobs_g nobs_r nobs_i mean_mag_g mean_mag_r mean_mag_i
_healpix_29
29154369144 108020450049144323 45.004993 0.01991 14.8925 14.1057 13.7921 371 387 18 14.916993 14.028394 13.727368
282950614202 108050450482618413 45.048268 0.048254 15.6769 14.9692 14.6613 381 391 18 15.716422 14.905342 14.602479
643735899268 108090449932592079 44.993251 0.076349 16.728901 16.2297 16.0023 381 393 18 16.760177 16.154505 15.969701
1429973025325 108160451321075838 45.132135 0.137899 12.935 12.51 12.338 381 0 18 12.7391 <NA> 12.192645
1476436129689 108180451646805881 45.164636 0.154514 15.7359 15.2979 15.1167 381 393 18 15.755066 15.243087 15.091201
1531897147170 108170451166024104 45.116649 0.144748 15.6025 15.1791 14.9971 381 393 18 15.615784 15.13288 14.971479
1668455889710 108150450581393374 45.058158 0.127411 13.2085 12.777 12.397 381 340 18 13.234258 12.412657 12.14209
1687311633264 108150450336704808 45.033569 0.128532 15.2136 14.5623 14.2601 381 393 18 15.232634 14.489734 14.195812
1755872301127 108170450559215794 45.055946 0.146135 15.6313 15.1561 14.9387 381 393 18 15.665383 15.094206 14.909868
2123226198499 108230450634635231 45.063462 0.195674 16.121099 15.688 15.5198 359 374 18 16.139978 15.637057 15.509201

10 rows × 12 columns

You can use parentheses, logical operators, and more than one column name in your expressions. Here, we alter the query to include not only those objects with a mean_mag_i that is brighter than 16, but which have at least 50 observations in that band. Note that this query takes longer than the original, mostly because it takes longer to find rows that satisfy this stricter query.

[7]:
%%time
bright_hi_obs = ztf_object.query("mean_mag_i < 16 and nobs_i > 50")
bright_hi_obs.head(10)
CPU times: user 1.1 s, sys: 40.6 ms, total: 1.14 s
Wall time: 3.61 s
[7]:
ps1_objid ra dec ps1_gMeanPSFMag ps1_rMeanPSFMag ps1_iMeanPSFMag nobs_g nobs_r nobs_i mean_mag_g mean_mag_r mean_mag_i
_healpix_29
9746096715954945 116280369296843043 36.929692 6.902155 15.3244 14.6802 14.4398 774 1001 60 15.355645 14.614888 14.384647
9746152217497820 116290369298993007 36.929925 6.910468 14.5683 14.1753 14.0213 774 1001 59 14.590252 14.126385 13.999373
9746182134477075 116300369318264577 36.931842 6.920077 16.212 15.7679 15.5924 774 1001 59 16.233738 15.726957 15.561661
9746646972146367 116320368771921345 36.877159 6.934018 15.2244 14.5829 14.3673 776 1006 60 15.237038 14.507473 14.276531
9746707367345686 116330369022928298 36.902303 6.9482 14.7373 14.2589 14.1125 776 1006 60 14.761919 14.222437 14.067747
9746876028689311 116370368601585361 36.860077 6.979021 15.9226 15.3299 15.07 776 1006 59 15.955589 15.268368 15.014759
9746935734663253 116360369123478735 36.91243 6.973317 -999.0 -999.0 14.5268 776 1006 60 13.256074 12.836332 12.721764
9746935734663253 116360369124398415 36.91243 6.973317 13.371 12.982 12.847 776 1006 60 13.256074 12.836332 12.721764
9747970250983889 116450369416576857 36.941667 7.046998 15.4739 14.9089 14.6781 422 476 59 15.497479 14.844907 14.622186
9748271495137218 116350368300792029 36.830091 6.959612 16.531 15.7944 15.5202 776 1005 60 16.559328 15.726777 15.455964

10 rows × 12 columns

4. Filtering using Python expressions#

In some cases it may be more readable to query using Python expressions, Pandas-style. In this form, the catalog is indexed using an expression, selecting the rows for which the expression is true. The form of this query is filtered = collection[expr_with_collection], where expr_with_collection needs to evaluate to something which is:

  • of the same size as collection; and

  • convertible to boolean

The below expression produces the same result as the earlier .query() example, and whether it is more tractable than .query() depends on your expression and what it includes. But there are a couple of fixes we need to make, things which .query() does for you.

  • The use of & instead of and (also, | vs. or). The Python logicals don’t work here.

  • Having to use ( and ) to ensure the intended precedence of the operators. (& and | are bitwise operators and, without parentheses, bind higher than the logical operators and and or.)

Note that the time taken is basically identical to that of the .query method. There is no particular performance advantage to either approach, as the underlying computations are vectorized the same way.

[8]:
%%time
bright_ex = ztf_object[(ztf_object["mean_mag_i"] < 16) & (ztf_object["nobs_i"] > 50)]
bright_ex.head()
CPU times: user 1.32 s, sys: 45.7 ms, total: 1.37 s
Wall time: 3.8 s
[8]:
ps1_objid ra dec ps1_gMeanPSFMag ps1_rMeanPSFMag ps1_iMeanPSFMag nobs_g nobs_r nobs_i mean_mag_g mean_mag_r mean_mag_i
_healpix_29
9746096715954945 116280369296843043 36.929692 6.902155 15.3244 14.6802 14.4398 774 1001 60 15.355645 14.614888 14.384647
9746152217497820 116290369298993007 36.929925 6.910468 14.5683 14.1753 14.0213 774 1001 59 14.590252 14.126385 13.999373
9746182134477075 116300369318264577 36.931842 6.920077 16.212 15.7679 15.5924 774 1001 59 16.233738 15.726957 15.561661
9746646972146367 116320368771921345 36.877159 6.934018 15.2244 14.5829 14.3673 776 1006 60 15.237038 14.507473 14.276531
9746707367345686 116330369022928298 36.902303 6.9482 14.7373 14.2589 14.1125 776 1006 60 14.761919 14.222437 14.067747

5 rows × 12 columns

Closing the Dask client#

[9]:
client.close()

About#

Authors: Sandro Campos, Melissa DeLucchi, Olivia Lynn, and Derek Jones

Last updated on: April 14, 2025

If you use lsdb for published research, please cite following instructions.