Row Filtering#
In this tutorial, we will:
set up a Dask client and open an object catalog
filter rows of data using expressions involving column values
preview catalog data and query results
Introduction#
When a catalog is opened, it is available for operations. However, its data is is lazily loaded, and operations on it are unrealized, until computation is called for explicitly (using the .compute() method) or implicitly, with data preview functions.
[1]:
import lsdb
from dask.distributed import Client
1. Open a catalog#
We create a basic dask client, and open an existing HATS catalog—the ZTF DR14 catalog.
Additional Help
For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.
For tips on accessing remote data, see our Accessing remote data guide
Create a basic Dask client, limiting the number of workers. This keeps subsequent operations from using more of our compute resources than we might intend, which is helpful in any case but especially when working on a shared resource.
[2]:
client = Client(n_workers=4, memory_limit="auto")
client
[2]:
Client
Client-523280ce-232f-11f1-96f6-8e6cf3dc31f8
| Connection method: Cluster object | Cluster type: distributed.LocalCluster |
| Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
02be60b2
| Dashboard: http://127.0.0.1:8787/status | Workers: 4 |
| Total threads: 4 | Total memory: 13.09 GiB |
| Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-6ed0635d-9e6f-4e5a-834a-cd160b6f03f6
| Comm: tcp://127.0.0.1:34373 | Workers: 0 |
| Dashboard: http://127.0.0.1:8787/status | Total threads: 0 |
| Started: Just now | Total memory: 0 B |
Workers
Worker: 0
| Comm: tcp://127.0.0.1:45781 | Total threads: 1 |
| Dashboard: http://127.0.0.1:40925/status | Memory: 3.27 GiB |
| Nanny: tcp://127.0.0.1:46561 | |
| Local directory: /tmp/dask-scratch-space/worker-w7xtvhfg | |
Worker: 1
| Comm: tcp://127.0.0.1:44465 | Total threads: 1 |
| Dashboard: http://127.0.0.1:41519/status | Memory: 3.27 GiB |
| Nanny: tcp://127.0.0.1:38173 | |
| Local directory: /tmp/dask-scratch-space/worker-u89msbyj | |
Worker: 2
| Comm: tcp://127.0.0.1:40073 | Total threads: 1 |
| Dashboard: http://127.0.0.1:37713/status | Memory: 3.27 GiB |
| Nanny: tcp://127.0.0.1:46605 | |
| Local directory: /tmp/dask-scratch-space/worker-ouylqdbf | |
Worker: 3
| Comm: tcp://127.0.0.1:44369 | Total threads: 1 |
| Dashboard: http://127.0.0.1:46493/status | Memory: 3.27 GiB |
| Nanny: tcp://127.0.0.1:33759 | |
| Local directory: /tmp/dask-scratch-space/worker-i9xiu5ol | |
[3]:
ztf_object_path = "https://data.lsdb.io/hats/ztf_dr14/ztf_object"
ztf_object = lsdb.open_catalog(ztf_object_path)
ztf_object
[3]:
| ps1_objid | ra | dec | ps1_gMeanPSFMag | ps1_rMeanPSFMag | ps1_iMeanPSFMag | nobs_g | nobs_r | nobs_i | mean_mag_g | mean_mag_r | mean_mag_i | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=2352 | ||||||||||||
| Order: 3, Pixel: 0 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | int32[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
| Order: 3, Pixel: 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 4, Pixel: 3070 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 4, Pixel: 3071 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1.1. Previewing part of the data#
Computing an entire catalog requires loading all of its resulting data into memory, which is expensive and may lead to out-of-memory issues.
Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). head() is a Pandas-like method which allows us to preview part of the data for this purpose. It iterates over the existing catalog partitions, in sequence, and finds up to n number of rows from the first partition(s) which have are able to supply those rows. Related methods include
.tail() and .sample().
There is also .random_sample(), but that method fetches rows from many partitions (rather than from first qualified), and so it can be much more expensive, even while it may be more representative.
Notice that all these previewing methods implicitly call compute(), and will implicitly use the Client we created earlier.
[4]:
ztf_object.head()
[4]:
| ps1_objid | ra | dec | ps1_gMeanPSFMag | ps1_rMeanPSFMag | ps1_iMeanPSFMag | nobs_g | nobs_r | nobs_i | mean_mag_g | mean_mag_r | mean_mag_i | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | ||||||||||||
| 3122680427 | 108000449961107213 | 44.996152 | 0.005626 | 18.1133 | 17.6278 | 17.4065 | 364 | 381 | 18 | 18.143208 | 17.569102 | 17.373423 |
| 24532305457 | 108020450140974134 | 45.01411 | 0.019775 | 21.565901 | 21.134199 | 21.414301 | 42 | 114 | 1 | 21.298568 | 21.007693 | 20.599543 |
| 26093200103 | 108010450069569645 | 45.007042 | 0.015954 | 22.047899 | 21.641199 | 21.655001 | 5 | 17 | 0 | 21.552179 | 21.405765 | <NA> |
| 29154369144 | 108020450049144323 | 45.004993 | 0.01991 | 14.8925 | 14.1057 | 13.7921 | 371 | 387 | 18 | 14.916993 | 14.028394 | 13.727368 |
| 29640602698 | 108020450042525676 | 45.004327 | 0.021077 | 19.278601 | 18.040701 | 16.853201 | 126 | 302 | 15 | 19.34495 | 17.897595 | 16.653296 |
5 rows × 12 columns
2. Selecting data rows by querying column values#
We can filter by column values via query().
The expression in the string given to .query() follows the same syntax accepted by Pandas’ .query(), which supports a subset of Python expressions for filtering DataFrames.
The column names that are not valid Python variables names should be wrapped in backticks, and any variable values can be injected using f-strings. The use of ‘@’ to reference variables is not supported.
More information about Pandas query strings is available here.
In the following query, we want to find objects in the catalog whose magnitude is brighter than 16:
[5]:
bright = ztf_object.query("mean_mag_i < 16")
bright
[5]:
| ps1_objid | ra | dec | ps1_gMeanPSFMag | ps1_rMeanPSFMag | ps1_iMeanPSFMag | nobs_g | nobs_r | nobs_i | mean_mag_g | mean_mag_r | mean_mag_i | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=2352 | ||||||||||||
| Order: 3, Pixel: 0 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | int32[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
| Order: 3, Pixel: 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 4, Pixel: 3070 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 4, Pixel: 3071 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
We’ll use .head() for a quick sanity check to be sure that no mean_mag_i is dimmer than 16. Since it’s only a few rows, it’s not a guarantee, but it does help us to be sure that we didn’t make any obvious mistake with our query expression.
[6]:
%%time
bright.head(10)
CPU times: user 448 ms, sys: 15.9 ms, total: 464 ms
Wall time: 1.33 s
[6]:
| ps1_objid | ra | dec | ps1_gMeanPSFMag | ps1_rMeanPSFMag | ps1_iMeanPSFMag | nobs_g | nobs_r | nobs_i | mean_mag_g | mean_mag_r | mean_mag_i | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | ||||||||||||
| 29154369144 | 108020450049144323 | 45.004993 | 0.01991 | 14.8925 | 14.1057 | 13.7921 | 371 | 387 | 18 | 14.916993 | 14.028394 | 13.727368 |
| 282950614202 | 108050450482618413 | 45.048268 | 0.048254 | 15.6769 | 14.9692 | 14.6613 | 381 | 391 | 18 | 15.716422 | 14.905342 | 14.602479 |
| 643735899268 | 108090449932592079 | 44.993251 | 0.076349 | 16.728901 | 16.2297 | 16.0023 | 381 | 393 | 18 | 16.760177 | 16.154505 | 15.969701 |
| 1429973025325 | 108160451321075838 | 45.132135 | 0.137899 | 12.935 | 12.51 | 12.338 | 381 | 0 | 18 | 12.7391 | <NA> | 12.192645 |
| 1476436129689 | 108180451646805881 | 45.164636 | 0.154514 | 15.7359 | 15.2979 | 15.1167 | 381 | 393 | 18 | 15.755066 | 15.243087 | 15.091201 |
| 1531897147170 | 108170451166024104 | 45.116649 | 0.144748 | 15.6025 | 15.1791 | 14.9971 | 381 | 393 | 18 | 15.615784 | 15.13288 | 14.971479 |
| 1668455889710 | 108150450581393374 | 45.058158 | 0.127411 | 13.2085 | 12.777 | 12.397 | 381 | 340 | 18 | 13.234258 | 12.412657 | 12.14209 |
| 1687311633264 | 108150450336704808 | 45.033569 | 0.128532 | 15.2136 | 14.5623 | 14.2601 | 381 | 393 | 18 | 15.232634 | 14.489734 | 14.195812 |
| 1755872301127 | 108170450559215794 | 45.055946 | 0.146135 | 15.6313 | 15.1561 | 14.9387 | 381 | 393 | 18 | 15.665383 | 15.094206 | 14.909868 |
| 2123226198499 | 108230450634635231 | 45.063462 | 0.195674 | 16.121099 | 15.688 | 15.5198 | 359 | 374 | 18 | 16.139978 | 15.637057 | 15.509201 |
10 rows × 12 columns
You can use parentheses, logical operators, and more than one column name in your expressions. Here, we alter the query to include not only those objects with a mean_mag_i that is brighter than 16, but which have at least 50 observations in that band. Note that this query takes longer than the original, mostly because it takes longer to find rows that satisfy this stricter query.
[7]:
%%time
bright_hi_obs = ztf_object.query("mean_mag_i < 16 and nobs_i > 50")
bright_hi_obs.head(10)
CPU times: user 1.1 s, sys: 40.6 ms, total: 1.14 s
Wall time: 3.61 s
[7]:
| ps1_objid | ra | dec | ps1_gMeanPSFMag | ps1_rMeanPSFMag | ps1_iMeanPSFMag | nobs_g | nobs_r | nobs_i | mean_mag_g | mean_mag_r | mean_mag_i | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | ||||||||||||
| 9746096715954945 | 116280369296843043 | 36.929692 | 6.902155 | 15.3244 | 14.6802 | 14.4398 | 774 | 1001 | 60 | 15.355645 | 14.614888 | 14.384647 |
| 9746152217497820 | 116290369298993007 | 36.929925 | 6.910468 | 14.5683 | 14.1753 | 14.0213 | 774 | 1001 | 59 | 14.590252 | 14.126385 | 13.999373 |
| 9746182134477075 | 116300369318264577 | 36.931842 | 6.920077 | 16.212 | 15.7679 | 15.5924 | 774 | 1001 | 59 | 16.233738 | 15.726957 | 15.561661 |
| 9746646972146367 | 116320368771921345 | 36.877159 | 6.934018 | 15.2244 | 14.5829 | 14.3673 | 776 | 1006 | 60 | 15.237038 | 14.507473 | 14.276531 |
| 9746707367345686 | 116330369022928298 | 36.902303 | 6.9482 | 14.7373 | 14.2589 | 14.1125 | 776 | 1006 | 60 | 14.761919 | 14.222437 | 14.067747 |
| 9746876028689311 | 116370368601585361 | 36.860077 | 6.979021 | 15.9226 | 15.3299 | 15.07 | 776 | 1006 | 59 | 15.955589 | 15.268368 | 15.014759 |
| 9746935734663253 | 116360369123478735 | 36.91243 | 6.973317 | -999.0 | -999.0 | 14.5268 | 776 | 1006 | 60 | 13.256074 | 12.836332 | 12.721764 |
| 9746935734663253 | 116360369124398415 | 36.91243 | 6.973317 | 13.371 | 12.982 | 12.847 | 776 | 1006 | 60 | 13.256074 | 12.836332 | 12.721764 |
| 9747970250983889 | 116450369416576857 | 36.941667 | 7.046998 | 15.4739 | 14.9089 | 14.6781 | 422 | 476 | 59 | 15.497479 | 14.844907 | 14.622186 |
| 9748271495137218 | 116350368300792029 | 36.830091 | 6.959612 | 16.531 | 15.7944 | 15.5202 | 776 | 1005 | 60 | 16.559328 | 15.726777 | 15.455964 |
10 rows × 12 columns
4. Filtering using Python expressions#
In some cases it may be more readable to query using Python expressions, Pandas-style. In this form, the catalog is indexed using an expression, selecting the rows for which the expression is true. The form of this query is filtered = collection[expr_with_collection], where expr_with_collection needs to evaluate to something which is:
of the same size as
collection; andconvertible to boolean
The below expression produces the same result as the earlier .query() example, and whether it is more tractable than .query() depends on your expression and what it includes. But there are a couple of fixes we need to make, things which .query() does for you.
The use of
&instead ofand(also,|vs.or). The Python logicals don’t work here.Having to use
(and)to ensure the intended precedence of the operators. (&and|are bitwise operators and, without parentheses, bind higher than the logical operatorsandandor.)
Note that the time taken is basically identical to that of the .query method. There is no particular performance advantage to either approach, as the underlying computations are vectorized the same way.
[8]:
%%time
bright_ex = ztf_object[(ztf_object["mean_mag_i"] < 16) & (ztf_object["nobs_i"] > 50)]
bright_ex.head()
CPU times: user 1.32 s, sys: 45.7 ms, total: 1.37 s
Wall time: 3.8 s
[8]:
| ps1_objid | ra | dec | ps1_gMeanPSFMag | ps1_rMeanPSFMag | ps1_iMeanPSFMag | nobs_g | nobs_r | nobs_i | mean_mag_g | mean_mag_r | mean_mag_i | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | ||||||||||||
| 9746096715954945 | 116280369296843043 | 36.929692 | 6.902155 | 15.3244 | 14.6802 | 14.4398 | 774 | 1001 | 60 | 15.355645 | 14.614888 | 14.384647 |
| 9746152217497820 | 116290369298993007 | 36.929925 | 6.910468 | 14.5683 | 14.1753 | 14.0213 | 774 | 1001 | 59 | 14.590252 | 14.126385 | 13.999373 |
| 9746182134477075 | 116300369318264577 | 36.931842 | 6.920077 | 16.212 | 15.7679 | 15.5924 | 774 | 1001 | 59 | 16.233738 | 15.726957 | 15.561661 |
| 9746646972146367 | 116320368771921345 | 36.877159 | 6.934018 | 15.2244 | 14.5829 | 14.3673 | 776 | 1006 | 60 | 15.237038 | 14.507473 | 14.276531 |
| 9746707367345686 | 116330369022928298 | 36.902303 | 6.9482 | 14.7373 | 14.2589 | 14.1125 | 776 | 1006 | 60 | 14.761919 | 14.222437 | 14.067747 |
5 rows × 12 columns
Closing the Dask client#
[9]:
client.close()
About#
Authors: Sandro Campos, Melissa DeLucchi, Olivia Lynn, and Derek Jones
Last updated on: April 14, 2025
If you use lsdb for published research, please cite following instructions.