Row Filtering

Row Filtering#

In this tutorial, we will:

set up a Dask client and open an object catalog
filter rows of data using expressions involving column values
preview catalog data and query results

Introduction#

When a catalog is opened, it is available for operations. However, its data is is lazily loaded, and operations on it are unrealized, until computation is called for explicitly (using the .compute() method) or implicitly, with data preview functions.

[1]:

import lsdb
from dask.distributed import Client

1. Open a catalog#

We create a basic dask client, and open an existing HATS catalog—the ZTF DR14 catalog.

Additional Help

For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.

For tips on accessing remote data, see our Accessing remote data guide

Create a basic Dask client, limiting the number of workers. This keeps subsequent operations from using more of our compute resources than we might intend, which is helpful in any case but especially when working on a shared resource.

[2]:

client = Client(n_workers=4, memory_limit="auto")
client

[2]:

Client

Client-523280ce-232f-11f1-96f6-8e6cf3dc31f8

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

02be60b2

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 13.09 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-6ed0635d-9e6f-4e5a-834a-cd160b6f03f6

Comm: tcp://127.0.0.1:34373	Workers: 0
Dashboard: http://127.0.0.1:8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Worker: 0

Comm: tcp://127.0.0.1:45781	Total threads: 1
Dashboard: http://127.0.0.1:40925/status	Memory: 3.27 GiB
Nanny: tcp://127.0.0.1:46561
Local directory: /tmp/dask-scratch-space/worker-w7xtvhfg

Worker: 1

Comm: tcp://127.0.0.1:44465	Total threads: 1
Dashboard: http://127.0.0.1:41519/status	Memory: 3.27 GiB
Nanny: tcp://127.0.0.1:38173
Local directory: /tmp/dask-scratch-space/worker-u89msbyj

Worker: 2

Comm: tcp://127.0.0.1:40073	Total threads: 1
Dashboard: http://127.0.0.1:37713/status	Memory: 3.27 GiB
Nanny: tcp://127.0.0.1:46605
Local directory: /tmp/dask-scratch-space/worker-ouylqdbf

Worker: 3

Comm: tcp://127.0.0.1:44369	Total threads: 1
Dashboard: http://127.0.0.1:46493/status	Memory: 3.27 GiB
Nanny: tcp://127.0.0.1:33759
Local directory: /tmp/dask-scratch-space/worker-i9xiu5ol

[3]:

ztf_object_path = "https://data.lsdb.io/hats/ztf_dr14/ztf_object"
ztf_object = lsdb.open_catalog(ztf_object_path)
ztf_object

[3]:

lsdb Catalog ztf_dr14:

	ps1_objid	ra	dec	ps1_gMeanPSFMag	ps1_rMeanPSFMag	ps1_iMeanPSFMag	nobs_g	nobs_r	nobs_i	mean_mag_g	mean_mag_r	mean_mag_i
npartitions=2352
Order: 3, Pixel: 0	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	int32[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]
Order: 3, Pixel: 1	...	...	...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...	...	...	...
Order: 4, Pixel: 3070	...	...	...	...	...	...	...	...	...	...	...	...
Order: 4, Pixel: 3071	...	...	...	...	...	...	...	...	...	...	...	...

12 out of 15 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

1.1. Previewing part of the data#

Computing an entire catalog requires loading all of its resulting data into memory, which is expensive and may lead to out-of-memory issues.

Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). head() is a Pandas-like method which allows us to preview part of the data for this purpose. It iterates over the existing catalog partitions, in sequence, and finds up to n number of rows from the first partition(s) which have are able to supply those rows. Related methods include .tail() and .sample().

There is also .random_sample(), but that method fetches rows from many partitions (rather than from first qualified), and so it can be much more expensive, even while it may be more representative.

Notice that all these previewing methods implicitly call compute(), and will implicitly use the Client we created earlier.

[4]:

ztf_object.head()

[4]:

	ps1_objid	ra	dec	ps1_gMeanPSFMag	ps1_rMeanPSFMag	ps1_iMeanPSFMag	nobs_g	nobs_r	nobs_i	mean_mag_g	mean_mag_r	mean_mag_i
_healpix_29
3122680427	108000449961107213	44.996152	0.005626	18.1133	17.6278	17.4065	364	381	18	18.143208	17.569102	17.373423
24532305457	108020450140974134	45.01411	0.019775	21.565901	21.134199	21.414301	42	114	1	21.298568	21.007693	20.599543
26093200103	108010450069569645	45.007042	0.015954	22.047899	21.641199	21.655001	5	17	0	21.552179	21.405765	<NA>
29154369144	108020450049144323	45.004993	0.01991	14.8925	14.1057	13.7921	371	387	18	14.916993	14.028394	13.727368
29640602698	108020450042525676	45.004327	0.021077	19.278601	18.040701	16.853201	126	302	15	19.34495	17.897595	16.653296

5 rows × 12 columns

2. Selecting data rows by querying column values#

We can filter by column values via query().

The expression in the string given to .query() follows the same syntax accepted by Pandas’ .query(), which supports a subset of Python expressions for filtering DataFrames.

The column names that are not valid Python variables names should be wrapped in backticks, and any variable values can be injected using f-strings. The use of ‘@’ to reference variables is not supported.

More information about Pandas query strings is available here.

In the following query, we want to find objects in the catalog whose magnitude is brighter than 16:

[5]:

bright = ztf_object.query("mean_mag_i < 16")
bright

[5]:

lsdb Catalog ztf_dr14:

	ps1_objid	ra	dec	ps1_gMeanPSFMag	ps1_rMeanPSFMag	ps1_iMeanPSFMag	nobs_g	nobs_r	nobs_i	mean_mag_g	mean_mag_r	mean_mag_i
npartitions=2352
Order: 3, Pixel: 0	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	int32[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]
Order: 3, Pixel: 1	...	...	...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...	...	...	...
Order: 4, Pixel: 3070	...	...	...	...	...	...	...	...	...	...	...	...
Order: 4, Pixel: 3071	...	...	...	...	...	...	...	...	...	...	...	...

12 out of 15 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

We’ll use .head() for a quick sanity check to be sure that no mean_mag_i is dimmer than 16. Since it’s only a few rows, it’s not a guarantee, but it does help us to be sure that we didn’t make any obvious mistake with our query expression.

[6]:

%%time
bright.head(10)

CPU times: user 448 ms, sys: 15.9 ms, total: 464 ms
Wall time: 1.33 s

[6]:

	ps1_objid	ra	dec	ps1_gMeanPSFMag	ps1_rMeanPSFMag	ps1_iMeanPSFMag	nobs_g	nobs_r	nobs_i	mean_mag_g	mean_mag_r	mean_mag_i
_healpix_29
29154369144	108020450049144323	45.004993	0.01991	14.8925	14.1057	13.7921	371	387	18	14.916993	14.028394	13.727368
282950614202	108050450482618413	45.048268	0.048254	15.6769	14.9692	14.6613	381	391	18	15.716422	14.905342	14.602479
643735899268	108090449932592079	44.993251	0.076349	16.728901	16.2297	16.0023	381	393	18	16.760177	16.154505	15.969701
1429973025325	108160451321075838	45.132135	0.137899	12.935	12.51	12.338	381	0	18	12.7391	<NA>	12.192645
1476436129689	108180451646805881	45.164636	0.154514	15.7359	15.2979	15.1167	381	393	18	15.755066	15.243087	15.091201
1531897147170	108170451166024104	45.116649	0.144748	15.6025	15.1791	14.9971	381	393	18	15.615784	15.13288	14.971479
1668455889710	108150450581393374	45.058158	0.127411	13.2085	12.777	12.397	381	340	18	13.234258	12.412657	12.14209
1687311633264	108150450336704808	45.033569	0.128532	15.2136	14.5623	14.2601	381	393	18	15.232634	14.489734	14.195812
1755872301127	108170450559215794	45.055946	0.146135	15.6313	15.1561	14.9387	381	393	18	15.665383	15.094206	14.909868
2123226198499	108230450634635231	45.063462	0.195674	16.121099	15.688	15.5198	359	374	18	16.139978	15.637057	15.509201

10 rows × 12 columns

You can use parentheses, logical operators, and more than one column name in your expressions. Here, we alter the query to include not only those objects with a mean_mag_i that is brighter than 16, but which have at least 50 observations in that band. Note that this query takes longer than the original, mostly because it takes longer to find rows that satisfy this stricter query.

[7]:

%%time
bright_hi_obs = ztf_object.query("mean_mag_i < 16 and nobs_i > 50")
bright_hi_obs.head(10)

CPU times: user 1.1 s, sys: 40.6 ms, total: 1.14 s
Wall time: 3.61 s

[7]:

	ps1_objid	ra	dec	ps1_gMeanPSFMag	ps1_rMeanPSFMag	ps1_iMeanPSFMag	nobs_g	nobs_r	nobs_i	mean_mag_g	mean_mag_r	mean_mag_i
_healpix_29
9746096715954945	116280369296843043	36.929692	6.902155	15.3244	14.6802	14.4398	774	1001	60	15.355645	14.614888	14.384647
9746152217497820	116290369298993007	36.929925	6.910468	14.5683	14.1753	14.0213	774	1001	59	14.590252	14.126385	13.999373
9746182134477075	116300369318264577	36.931842	6.920077	16.212	15.7679	15.5924	774	1001	59	16.233738	15.726957	15.561661
9746646972146367	116320368771921345	36.877159	6.934018	15.2244	14.5829	14.3673	776	1006	60	15.237038	14.507473	14.276531
9746707367345686	116330369022928298	36.902303	6.9482	14.7373	14.2589	14.1125	776	1006	60	14.761919	14.222437	14.067747
9746876028689311	116370368601585361	36.860077	6.979021	15.9226	15.3299	15.07	776	1006	59	15.955589	15.268368	15.014759
9746935734663253	116360369123478735	36.91243	6.973317	-999.0	-999.0	14.5268	776	1006	60	13.256074	12.836332	12.721764
9746935734663253	116360369124398415	36.91243	6.973317	13.371	12.982	12.847	776	1006	60	13.256074	12.836332	12.721764
9747970250983889	116450369416576857	36.941667	7.046998	15.4739	14.9089	14.6781	422	476	59	15.497479	14.844907	14.622186
9748271495137218	116350368300792029	36.830091	6.959612	16.531	15.7944	15.5202	776	1005	60	16.559328	15.726777	15.455964

10 rows × 12 columns

4. Filtering using Python expressions#

In some cases it may be more readable to query using Python expressions, Pandas-style. In this form, the catalog is indexed using an expression, selecting the rows for which the expression is true. The form of this query is filtered = collection[expr_with_collection], where expr_with_collection needs to evaluate to something which is:

of the same size as collection; and
convertible to boolean

The below expression produces the same result as the earlier .query() example, and whether it is more tractable than .query() depends on your expression and what it includes. But there are a couple of fixes we need to make, things which .query() does for you.

The use of & instead of and (also, | vs. or). The Python logicals don’t work here.
Having to use ( and ) to ensure the intended precedence of the operators. (& and | are bitwise operators and, without parentheses, bind higher than the logical operators and and or.)

Note that the time taken is basically identical to that of the .query method. There is no particular performance advantage to either approach, as the underlying computations are vectorized the same way.

[8]:

%%time
bright_ex = ztf_object[(ztf_object["mean_mag_i"] < 16) & (ztf_object["nobs_i"] > 50)]
bright_ex.head()

CPU times: user 1.32 s, sys: 45.7 ms, total: 1.37 s
Wall time: 3.8 s

[8]:

	ps1_objid	ra	dec	ps1_gMeanPSFMag	ps1_rMeanPSFMag	ps1_iMeanPSFMag	nobs_g	nobs_r	nobs_i	mean_mag_g	mean_mag_r	mean_mag_i
_healpix_29
9746096715954945	116280369296843043	36.929692	6.902155	15.3244	14.6802	14.4398	774	1001	60	15.355645	14.614888	14.384647
9746152217497820	116290369298993007	36.929925	6.910468	14.5683	14.1753	14.0213	774	1001	59	14.590252	14.126385	13.999373
9746182134477075	116300369318264577	36.931842	6.920077	16.212	15.7679	15.5924	774	1001	59	16.233738	15.726957	15.561661
9746646972146367	116320368771921345	36.877159	6.934018	15.2244	14.5829	14.3673	776	1006	60	15.237038	14.507473	14.276531
9746707367345686	116330369022928298	36.902303	6.9482	14.7373	14.2589	14.1125	776	1006	60	14.761919	14.222437	14.067747