Creating a custom search with global statistics#

In this tutorial, we will:

  • showcase additional statistics available on all catalogs

  • illustrate how to filter a catalog using those statistics

  • demonstrate a custom structured search filter class

Introduction#

We’re going to find all stars in GAIA Bailor-Jones that are closer than 3 parsecs away, using median photogeometric distance estimates.

For numeric extrema, we can use the per_pixel_statistics values, derived from the parquet _metadata file to perform some pretty strong filtering: if we want all rows with value less than k, then the minimum value in the data partition must be less than k.

[1]:
import lsdb

1. Load a catalog and statistics#

First, we load the GAIA Bailor-Jones catalog, and check out the per-pixel statistics. These are re-formatted and loaded into a pandas dataframe, for ease of viewing the data and performing filters. We only care about the r_med_photogeo field for this, so we’ll add that filter to the statistics loading.

[2]:
gaia_d = lsdb.open_catalog("https://data.lsdb.io/hats/gaia_edr3_distances/gaia_edr3_distances/")
per_pixel = gaia_d.per_pixel_statistics(include_columns=["r_med_photogeo"])
per_pixel
[2]:
r_med_photogeo: min_value r_med_photogeo: max_value r_med_photogeo: null_count r_med_photogeo: row_count
Order: 2, Pixel: 0 7.220708 30164.0723 4060 613898
Order: 3, Pixel: 4 9.662493 24033.4453 1527 234484
... ... ... ... ...
Order: 3, Pixel: 766 18.323370 44377.1016 9376 960532
Order: 3, Pixel: 767 14.478624 44927.0469 5648 704797

3243 rows × 4 columns

2. Find pixels of interest#

We can query these global stats to find just SEVEN partitions that have values in this extreme end. That’s less than half a percent of all data partitions.

[3]:
nearby = per_pixel.query("`r_med_photogeo: min_value` < 3")
limit_pixels = nearby.index.tolist()
limit_pixels
[3]:
[Order: 2, Pixel: 21,
 Order: 4, Pixel: 1308,
 Order: 2, Pixel: 107,
 Order: 5, Pixel: 7238,
 Order: 4, Pixel: 1986,
 Order: 2, Pixel: 142,
 Order: 6, Pixel: 41591]

Next, we combine the steps of performing a PixelSearch over the catalog with a query to find only those rows in the data partitions that meet the criteria. This search is orders of magnitude faster than scanning all of the data partitions!

Note that a lot of the time will still be I/O, as this data set is hosted remotely.

[4]:
gaia_d.pixel_search(pixels=limit_pixels).query("r_med_photogeo < 3").compute()
[4]:
source_id r_med_geo r_lo_geo r_hi_geo r_med_photogeo r_lo_photogeo r_hi_photogeo flag ra dec Norder Dir Npix
_healpix_29
381407739701909405 762815470562110464 2.545952 2.545706 2.546205 2.545851 2.545692 2.546106 10033 165.83096 35.948653 2 0 21
1473525239409412345 2947050466531873024 2.670147 2.668469 2.671663 2.670632 2.668852 2.672157 10012 101.286626 -16.720933 4 0 1308
1932486467174911786 3864972938605115520 2.408358 2.407998 2.408747 2.408285 2.407773 2.40883 10022 164.10319 7.002727 2 0 107
2037570889377880190 4075141768785646848 2.975431 2.975131 2.975705 2.975421 2.975177 2.975699 10033 282.458789 -23.837097 5 0 7238
2236416425020061922 4472832130942575872 1.828106 1.827975 1.828225 1.8281 1.82797 1.828225 10033 269.448503 4.73942 4 0 1986
2570346799484574619 5140693571158739840 2.718584 2.713234 2.724433 2.718944 2.714043 2.724843 10022 24.771554 -17.9483 2 0 142
2570346800245147123 5140693571158946048 2.67459 2.67066 2.678184 2.675949 2.672405 2.679119 10022 24.771674 -17.947683 2 0 142
2926749359394994434 5853498713190525696 1.301911 1.301814 1.302005 1.301935 1.301849 1.302006 10033 217.392321 -62.676075 6 40000 41591

8 rows × 13 columns

3. Creating a custom search class#

We can turn these two stages into a custom structured search by implementing the AbstractSearch class.

The implementation below is probably more finicky than what you might implement, but it provides a very generic search over minimum or maximum values, inclusive or exclusive of the extrema value, over any field in the catalog dataset.

  1. __init__ - This is where a lot of the fussy stuff of parameter checking happens

  2. perform_hc_catalog_filter - This will perform the “coarse” filtering of the underlying HATS catalog structure. Here, we determine which data partitions we want to keep, and create a new HATS structure to match those pixels.

  3. search_points - This will perform the “fine” filtering, when given a full data partition of catalog rows.

[5]:
from lsdb.core.search.abstract_search import AbstractSearch
from lsdb.types import HCCatalogTypeVar
import nested_pandas as npd


class ExtremaSearch(AbstractSearch):
    def __init__(
        self,
        field: str = None,
        less_than_value=None,
        greater_than_value=None,
        inclusive=False,
        fine: bool = True,
    ):
        super().__init__(fine)
        if field is None:
            raise ValueError("field is required")
        if less_than_value is None == greater_than_value is None:
            raise ValueError("exactly one of less_than_value or greater_than_value is required")
        self.field = field
        self.less_than_value = less_than_value
        self.greater_than_value = greater_than_value
        self.stat = "min_value" if self.less_than_value is not None else "max_value"
        stat_value = self.less_than_value or self.greater_than_value
        self.stat_clause = (
            f' {"<" if self.less_than_value is not None else ">"}{"=" if inclusive else ""} {stat_value}'
        )

    def perform_hc_catalog_filter(self, hc_structure: HCCatalogTypeVar) -> HCCatalogTypeVar:
        """Determine the pixels for which there is a result in each field"""
        stats = hc_structure.per_pixel_statistics(include_columns=[self.field], include_stats=[self.stat])
        if len(stats) == 0:
            raise ValueError("Error with extrema search - column not found")

        within_extrema = stats.query(f"`{self.field}: {self.stat}` {self.stat_clause}")
        limit_pixels = within_extrema.index.tolist()
        if len(limit_pixels) == 0:
            raise ValueError("Error with extrema search - there's nothing that extreme!!")
        return hc_structure.filter_from_pixel_list(limit_pixels)

    def search_points(self, frame: npd.NestedFrame, _) -> npd.NestedFrame:
        """Determine the search results within a data frame"""
        return frame.query(f"{self.field}{self.stat_clause}")

About#

Authors: Melissa DeLucchi

Last updated /verified on: October 27, 2025

If you use lsdb for published research, please cite following instructions.