Creating a custom search with global statistics#
In this tutorial, we will:
showcase additional statistics available on all catalogs
illustrate how to filter a catalog using those statistics
demonstrate a custom structured search filter class
Introduction#
We’re going to find all stars in GAIA Bailor-Jones that are closer than 3 parsecs away, using median photogeometric distance estimates.
For numeric extrema, we can use the per_pixel_statistics values, derived from the parquet _metadata file to perform some pretty strong filtering: if we want all rows with value less than k, then the minimum value in the data partition must be less than k.
[1]:
import lsdb
1. Load a catalog and statistics#
First, we load the GAIA Bailor-Jones catalog, and check out the per-pixel statistics. These are re-formatted and loaded into a pandas dataframe, for ease of viewing the data and performing filters. We only care about the r_med_photogeo field for this, so we’ll add that filter to the statistics loading.
[2]:
gaia_d = lsdb.open_catalog("https://data.lsdb.io/hats/gaia_edr3_distances/gaia_edr3_distances/")
per_pixel = gaia_d.per_pixel_statistics(include_columns=["r_med_photogeo"])
per_pixel
[2]:
| r_med_photogeo: min_value | r_med_photogeo: max_value | r_med_photogeo: null_count | r_med_photogeo: row_count | |
|---|---|---|---|---|
| Order: 2, Pixel: 0 | 7.220708 | 30164.0723 | 4060 | 613898 |
| Order: 3, Pixel: 4 | 9.662493 | 24033.4453 | 1527 | 234484 |
| ... | ... | ... | ... | ... |
| Order: 3, Pixel: 766 | 18.323370 | 44377.1016 | 9376 | 960532 |
| Order: 3, Pixel: 767 | 14.478624 | 44927.0469 | 5648 | 704797 |
3243 rows × 4 columns
2. Find pixels of interest#
We can query these global stats to find just SEVEN partitions that have values in this extreme end. That’s less than half a percent of all data partitions.
[3]:
nearby = per_pixel.query("`r_med_photogeo: min_value` < 3")
limit_pixels = nearby.index.tolist()
limit_pixels
[3]:
[Order: 2, Pixel: 21,
Order: 4, Pixel: 1308,
Order: 2, Pixel: 107,
Order: 5, Pixel: 7238,
Order: 4, Pixel: 1986,
Order: 2, Pixel: 142,
Order: 6, Pixel: 41591]
Next, we combine the steps of performing a PixelSearch over the catalog with a query to find only those rows in the data partitions that meet the criteria. This search is orders of magnitude faster than scanning all of the data partitions!
Note that a lot of the time will still be I/O, as this data set is hosted remotely.
[4]:
gaia_d.pixel_search(pixels=limit_pixels).query("r_med_photogeo < 3").compute()
[4]:
| source_id | r_med_geo | r_lo_geo | r_hi_geo | r_med_photogeo | r_lo_photogeo | r_hi_photogeo | flag | ra | dec | Norder | Dir | Npix | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||||||||
| 381407739701909405 | 762815470562110464 | 2.545952 | 2.545706 | 2.546205 | 2.545851 | 2.545692 | 2.546106 | 10033 | 165.83096 | 35.948653 | 2 | 0 | 21 |
| 1473525239409412345 | 2947050466531873024 | 2.670147 | 2.668469 | 2.671663 | 2.670632 | 2.668852 | 2.672157 | 10012 | 101.286626 | -16.720933 | 4 | 0 | 1308 |
| 1932486467174911786 | 3864972938605115520 | 2.408358 | 2.407998 | 2.408747 | 2.408285 | 2.407773 | 2.40883 | 10022 | 164.10319 | 7.002727 | 2 | 0 | 107 |
| 2037570889377880190 | 4075141768785646848 | 2.975431 | 2.975131 | 2.975705 | 2.975421 | 2.975177 | 2.975699 | 10033 | 282.458789 | -23.837097 | 5 | 0 | 7238 |
| 2236416425020061922 | 4472832130942575872 | 1.828106 | 1.827975 | 1.828225 | 1.8281 | 1.82797 | 1.828225 | 10033 | 269.448503 | 4.73942 | 4 | 0 | 1986 |
| 2570346799484574619 | 5140693571158739840 | 2.718584 | 2.713234 | 2.724433 | 2.718944 | 2.714043 | 2.724843 | 10022 | 24.771554 | -17.9483 | 2 | 0 | 142 |
| 2570346800245147123 | 5140693571158946048 | 2.67459 | 2.67066 | 2.678184 | 2.675949 | 2.672405 | 2.679119 | 10022 | 24.771674 | -17.947683 | 2 | 0 | 142 |
| 2926749359394994434 | 5853498713190525696 | 1.301911 | 1.301814 | 1.302005 | 1.301935 | 1.301849 | 1.302006 | 10033 | 217.392321 | -62.676075 | 6 | 40000 | 41591 |
8 rows × 13 columns
3. Creating a custom search class#
We can turn these two stages into a custom structured search by implementing the AbstractSearch class.
The implementation below is probably more finicky than what you might implement, but it provides a very generic search over minimum or maximum values, inclusive or exclusive of the extrema value, over any field in the catalog dataset.
__init__- This is where a lot of the fussy stuff of parameter checking happensperform_hc_catalog_filter- This will perform the “coarse” filtering of the underlying HATS catalog structure. Here, we determine which data partitions we want to keep, and create a new HATS structure to match those pixels.search_points- This will perform the “fine” filtering, when given a full data partition of catalog rows.
[5]:
from lsdb.core.search.abstract_search import AbstractSearch
from lsdb.types import HCCatalogTypeVar
import nested_pandas as npd
class ExtremaSearch(AbstractSearch):
def __init__(
self,
field: str = None,
less_than_value=None,
greater_than_value=None,
inclusive=False,
fine: bool = True,
):
super().__init__(fine)
if field is None:
raise ValueError("field is required")
if less_than_value is None == greater_than_value is None:
raise ValueError("exactly one of less_than_value or greater_than_value is required")
self.field = field
self.less_than_value = less_than_value
self.greater_than_value = greater_than_value
self.stat = "min_value" if self.less_than_value is not None else "max_value"
stat_value = self.less_than_value or self.greater_than_value
self.stat_clause = (
f' {"<" if self.less_than_value is not None else ">"}{"=" if inclusive else ""} {stat_value}'
)
def perform_hc_catalog_filter(self, hc_structure: HCCatalogTypeVar) -> HCCatalogTypeVar:
"""Determine the pixels for which there is a result in each field"""
stats = hc_structure.per_pixel_statistics(include_columns=[self.field], include_stats=[self.stat])
if len(stats) == 0:
raise ValueError("Error with extrema search - column not found")
within_extrema = stats.query(f"`{self.field}: {self.stat}` {self.stat_clause}")
limit_pixels = within_extrema.index.tolist()
if len(limit_pixels) == 0:
raise ValueError("Error with extrema search - there's nothing that extreme!!")
return hc_structure.filter_from_pixel_list(limit_pixels)
def search_points(self, frame: npd.NestedFrame, _) -> npd.NestedFrame:
"""Determine the search results within a data frame"""
return frame.query(f"{self.field}{self.stat_clause}")
4. Issuing the search#
The effort to make the search object pays off when we want to issue a few of these searches, using different fields with different extrema values.
[6]:
search_object = ExtremaSearch(field="r_med_photogeo", less_than_value=3)
gaia_d.search(search_object).compute()
[6]:
| source_id | r_med_geo | r_lo_geo | r_hi_geo | r_med_photogeo | r_lo_photogeo | r_hi_photogeo | flag | ra | dec | Norder | Dir | Npix | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||||||||
| 381407739701909405 | 762815470562110464 | 2.545952 | 2.545706 | 2.546205 | 2.545851 | 2.545692 | 2.546106 | 10033 | 165.83096 | 35.948653 | 2 | 0 | 21 |
| 1473525239409412345 | 2947050466531873024 | 2.670147 | 2.668469 | 2.671663 | 2.670632 | 2.668852 | 2.672157 | 10012 | 101.286626 | -16.720933 | 4 | 0 | 1308 |
| 1932486467174911786 | 3864972938605115520 | 2.408358 | 2.407998 | 2.408747 | 2.408285 | 2.407773 | 2.40883 | 10022 | 164.10319 | 7.002727 | 2 | 0 | 107 |
| 2037570889377880190 | 4075141768785646848 | 2.975431 | 2.975131 | 2.975705 | 2.975421 | 2.975177 | 2.975699 | 10033 | 282.458789 | -23.837097 | 5 | 0 | 7238 |
| 2236416425020061922 | 4472832130942575872 | 1.828106 | 1.827975 | 1.828225 | 1.8281 | 1.82797 | 1.828225 | 10033 | 269.448503 | 4.73942 | 4 | 0 | 1986 |
| 2570346799484574619 | 5140693571158739840 | 2.718584 | 2.713234 | 2.724433 | 2.718944 | 2.714043 | 2.724843 | 10022 | 24.771554 | -17.9483 | 2 | 0 | 142 |
| 2570346800245147123 | 5140693571158946048 | 2.67459 | 2.67066 | 2.678184 | 2.675949 | 2.672405 | 2.679119 | 10022 | 24.771674 | -17.947683 | 2 | 0 | 142 |
| 2926749359394994434 | 5853498713190525696 | 1.301911 | 1.301814 | 1.302005 | 1.301935 | 1.301849 | 1.302006 | 10033 | 217.392321 | -62.676075 | 6 | 40000 | 41591 |
8 rows × 13 columns
[7]:
search_object = ExtremaSearch(field="r_hi_geo", greater_than_value=100_000)
gaia_d.search(search_object).compute()
[7]:
| source_id | r_med_geo | r_lo_geo | r_hi_geo | r_med_photogeo | r_lo_photogeo | r_hi_photogeo | flag | ra | dec | Norder | Dir | Npix | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||||||||
| 2319134127658377825 | 4638268223858551296 | 71831.4062 | 44965.7852 | 124300.227 | 39193.8711 | 32700.9375 | 48062.2188 | 10033 | 27.586025 | -74.355689 | 3 | 0 | 514 |
| 2319140599809555019 | 4638281181775861376 | 76911.8594 | 56610.3945 | 105344.438 | 21621.834 | 16319.9121 | 27800.6914 | 10022 | 28.218428 | -74.096348 | 3 | 0 | 514 |
| 2329478941386825612 | 4658957875932644352 | 46664.0547 | 33256.7617 | 121515.211 | 33280.8242 | 26262.4434 | 41199.6523 | 10022 | 79.065792 | -66.899222 | 5 | 0 | 8275 |
| 2342986087888935203 | 4685972155110368000 | 48608.668 | 20563.998 | 101465.523 | 10240.5146 | 9428.91309 | 11350.5986 | 11033 | 15.143423 | -72.742918 | 6 | 30000 | 33295 |
| 2342995861826775384 | 4685991705804612992 | 125834.633 | 73659.1562 | 144939.844 | 60050.5117 | 38000.2656 | 72400.2969 | 10033 | 14.587783 | -72.584025 | 6 | 30000 | 33295 |
5 rows × 13 columns
About#
Authors: Melissa DeLucchi
Last updated /verified on: October 27, 2025
If you use lsdb for published research, please cite following instructions.