Creating a custom search with global statistics

Creating a custom search with global statistics#

In this tutorial, we will:

showcase additional statistics available on all catalogs
illustrate how to filter a catalog using those statistics
demonstrate a custom structured search filter class

Introduction#

We’re going to find all stars in GAIA Bailor-Jones that are closer than 3 parsecs away, using median photogeometric distance estimates.

For numeric extrema, we can use the per_pixel_statistics values, derived from the parquet _metadata file to perform some pretty strong filtering: if we want all rows with value less than k, then the minimum value in the data partition must be less than k.

[1]:

import lsdb

1. Load a catalog and statistics#

First, we load the GAIA Bailor-Jones catalog, and check out the per-pixel statistics. These are re-formatted and loaded into a pandas dataframe, for ease of viewing the data and performing filters. We only care about the r_med_photogeo field for this, so we’ll add that filter to the statistics loading.

[2]:

gaia_d = lsdb.open_catalog("https://data.lsdb.io/hats/gaia_edr3_distances/gaia_edr3_distances/")
per_pixel = gaia_d.per_pixel_statistics(include_columns=["r_med_photogeo"])
per_pixel

[2]:

	r_med_photogeo: min_value	r_med_photogeo: max_value	r_med_photogeo: null_count	r_med_photogeo: row_count
Order: 2, Pixel: 0	7.220708	30164.0723	4060	613898
Order: 3, Pixel: 4	9.662493	24033.4453	1527	234484
...	...	...	...	...
Order: 3, Pixel: 766	18.323370	44377.1016	9376	960532
Order: 3, Pixel: 767	14.478624	44927.0469	5648	704797

3243 rows × 4 columns

2. Find pixels of interest#

We can query these global stats to find just SEVEN partitions that have values in this extreme end. That’s less than half a percent of all data partitions.

[3]:

nearby = per_pixel.query("`r_med_photogeo: min_value` < 3")
limit_pixels = nearby.index.tolist()
limit_pixels

[3]:

[Order: 2, Pixel: 21,
 Order: 4, Pixel: 1308,
 Order: 2, Pixel: 107,
 Order: 5, Pixel: 7238,
 Order: 4, Pixel: 1986,
 Order: 2, Pixel: 142,
 Order: 6, Pixel: 41591]

Next, we combine the steps of performing a PixelSearch over the catalog with a query to find only those rows in the data partitions that meet the criteria. This search is orders of magnitude faster than scanning all of the data partitions!

Note that a lot of the time will still be I/O, as this data set is hosted remotely.

[4]:

gaia_d.pixel_search(pixels=limit_pixels).query("r_med_photogeo < 3").compute()

[4]:

	source_id	r_med_geo	r_lo_geo	r_hi_geo	r_med_photogeo	r_lo_photogeo	r_hi_photogeo	flag	ra	dec	Norder	Dir	Npix
_healpix_29
381407739701909405	762815470562110464	2.545952	2.545706	2.546205	2.545851	2.545692	2.546106	10033	165.83096	35.948653	2	0	21
1473525239409412345	2947050466531873024	2.670147	2.668469	2.671663	2.670632	2.668852	2.672157	10012	101.286626	-16.720933	4	0	1308
1932486467174911786	3864972938605115520	2.408358	2.407998	2.408747	2.408285	2.407773	2.40883	10022	164.10319	7.002727	2	0	107
2037570889377880190	4075141768785646848	2.975431	2.975131	2.975705	2.975421	2.975177	2.975699	10033	282.458789	-23.837097	5	0	7238
2236416425020061922	4472832130942575872	1.828106	1.827975	1.828225	1.8281	1.82797	1.828225	10033	269.448503	4.73942	4	0	1986
2570346799484574619	5140693571158739840	2.718584	2.713234	2.724433	2.718944	2.714043	2.724843	10022	24.771554	-17.9483	2	0	142
2570346800245147123	5140693571158946048	2.67459	2.67066	2.678184	2.675949	2.672405	2.679119	10022	24.771674	-17.947683	2	0	142
2926749359394994434	5853498713190525696	1.301911	1.301814	1.302005	1.301935	1.301849	1.302006	10033	217.392321	-62.676075	6	40000	41591

8 rows × 13 columns

3. Creating a custom search class#

We can turn these two stages into a custom structured search by implementing the AbstractSearch class.

The implementation below is probably more finicky than what you might implement, but it provides a very generic search over minimum or maximum values, inclusive or exclusive of the extrema value, over any field in the catalog dataset.

__init__ - This is where a lot of the fussy stuff of parameter checking happens
perform_hc_catalog_filter - This will perform the “coarse” filtering of the underlying HATS catalog structure. Here, we determine which data partitions we want to keep, and create a new HATS structure to match those pixels.
search_points - This will perform the “fine” filtering, when given a full data partition of catalog rows.

[5]:

from lsdb.core.search.abstract_search import AbstractSearch
from lsdb.types import HCCatalogTypeVar
import nested_pandas as npd


class ExtremaSearch(AbstractSearch):
    def __init__(
        self,
        field: str = None,
        less_than_value=None,
        greater_than_value=None,
        inclusive=False,
        fine: bool = True,
    ):
        super().__init__(fine)
        if field is None:
            raise ValueError("field is required")
        if less_than_value is None == greater_than_value is None:
            raise ValueError("exactly one of less_than_value or greater_than_value is required")
        self.field = field
        self.less_than_value = less_than_value
        self.greater_than_value = greater_than_value
        self.stat = "min_value" if self.less_than_value is not None else "max_value"
        stat_value = self.less_than_value or self.greater_than_value
        self.stat_clause = (
            f' {"<" if self.less_than_value is not None else ">"}{"=" if inclusive else ""} {stat_value}'
        )

    def perform_hc_catalog_filter(self, hc_structure: HCCatalogTypeVar) -> HCCatalogTypeVar:
        """Determine the pixels for which there is a result in each field"""
        stats = hc_structure.per_pixel_statistics(include_columns=[self.field], include_stats=[self.stat])
        if len(stats) == 0:
            raise ValueError("Error with extrema search - column not found")

        within_extrema = stats.query(f"`{self.field}: {self.stat}` {self.stat_clause}")
        limit_pixels = within_extrema.index.tolist()
        if len(limit_pixels) == 0:
            raise ValueError("Error with extrema search - there's nothing that extreme!!")
        return hc_structure.filter_from_pixel_list(limit_pixels)

    def search_points(self, frame: npd.NestedFrame, _) -> npd.NestedFrame:
        """Determine the search results within a data frame"""
        return frame.query(f"{self.field}{self.stat_clause}")

4. Issuing the search#

The effort to make the search object pays off when we want to issue a few of these searches, using different fields with different extrema values.

[6]:

search_object = ExtremaSearch(field="r_med_photogeo", less_than_value=3)
gaia_d.search(search_object).compute()

[6]:

	source_id	r_med_geo	r_lo_geo	r_hi_geo	r_med_photogeo	r_lo_photogeo	r_hi_photogeo	flag	ra	dec	Norder	Dir	Npix
_healpix_29
381407739701909405	762815470562110464	2.545952	2.545706	2.546205	2.545851	2.545692	2.546106	10033	165.83096	35.948653	2	0	21
1473525239409412345	2947050466531873024	2.670147	2.668469	2.671663	2.670632	2.668852	2.672157	10012	101.286626	-16.720933	4	0	1308
1932486467174911786	3864972938605115520	2.408358	2.407998	2.408747	2.408285	2.407773	2.40883	10022	164.10319	7.002727	2	0	107
2037570889377880190	4075141768785646848	2.975431	2.975131	2.975705	2.975421	2.975177	2.975699	10033	282.458789	-23.837097	5	0	7238
2236416425020061922	4472832130942575872	1.828106	1.827975	1.828225	1.8281	1.82797	1.828225	10033	269.448503	4.73942	4	0	1986
2570346799484574619	5140693571158739840	2.718584	2.713234	2.724433	2.718944	2.714043	2.724843	10022	24.771554	-17.9483	2	0	142
2570346800245147123	5140693571158946048	2.67459	2.67066	2.678184	2.675949	2.672405	2.679119	10022	24.771674	-17.947683	2	0	142
2926749359394994434	5853498713190525696	1.301911	1.301814	1.302005	1.301935	1.301849	1.302006	10033	217.392321	-62.676075	6	40000	41591

8 rows × 13 columns

[7]:

search_object = ExtremaSearch(field="r_hi_geo", greater_than_value=100_000)
gaia_d.search(search_object).compute()

[7]:

	source_id	r_med_geo	r_lo_geo	r_hi_geo	r_med_photogeo	r_lo_photogeo	r_hi_photogeo	flag	ra	dec	Norder	Dir	Npix
_healpix_29
2319134127658377825	4638268223858551296	71831.4062	44965.7852	124300.227	39193.8711	32700.9375	48062.2188	10033	27.586025	-74.355689	3	0	514
2319140599809555019	4638281181775861376	76911.8594	56610.3945	105344.438	21621.834	16319.9121	27800.6914	10022	28.218428	-74.096348	3	0	514
2329478941386825612	4658957875932644352	46664.0547	33256.7617	121515.211	33280.8242	26262.4434	41199.6523	10022	79.065792	-66.899222	5	0	8275
2342986087888935203	4685972155110368000	48608.668	20563.998	101465.523	10240.5146	9428.91309	11350.5986	11033	15.143423	-72.742918	6	30000	33295
2342995861826775384	4685991705804612992	125834.633	73659.1562	144939.844	60050.5117	38000.2656	72400.2969	10033	14.587783	-72.584025	6	30000	33295

5 rows × 13 columns

About#

Authors: Melissa DeLucchi

Last updated /verified on: October 27, 2025

If you use lsdb for published research, please cite following instructions.