Manual catalog verification#

This notebook presents methods for verifying that a directory contains a valid HATS catalog and performing manual verification through inspecting the catalog metadata and contents.

1. Directory verification#

The HATS library provides a method to verify that a directory contains the appropriate metadata files.

There are a few flavors of the validation, and the quickest one doesn’t take any additional flags:

[1]:
from hats.io.validation import is_valid_catalog
import hats
from upath import UPath

gaia_catalog_path = UPath("https://data.lsdb.io/hats/gaia_dr3/gaia/")
is_valid_catalog(gaia_catalog_path)
[1]:
True

1.1. Explaining the input and output#

The strict argument takes us through a different code path that rigorously tests the contents of all ancillary metadata files and the consistency of the partition pixels.

Here, we use the verbose=True argument to print out a little bit more information about our catalog. It will repeat the path that we’re looking at, display the total number of partitions, and calculate the approximate sky coverage, based on the area of the HATS tiles.

The fail_fast argument will determine if we break out of the method at the first sign of trouble or keep looking for validation problems. This can be useful if you’re debugging multiple points of failure in a catalog.

[2]:
is_valid_catalog(gaia_catalog_path, verbose=True, fail_fast=False, strict=True)
Validating catalog at path https://data.lsdb.io/hats/gaia_dr3/gaia/ ...
Found 2016 partitions.
Approximate coverage is 100.00 % of the sky.
[2]:
True

2. Columns and data types#

HATS tables are backed by parquet files. These files store metadata about their columns, the data types, and even the range of values.

The columns and types are stored on the catalog.schema attribute with a pyarrow.Schema object. You can find more details on this object and its use in the pyarrow documents

Gaia has a lot of columns, so this display is long!

[3]:
catalog_object = hats.read_hats(gaia_catalog_path)
catalog_object.schema
[3]:
_healpix_29: int64
solution_id: int64
designation: string
source_id: int64
random_index: int64
ref_epoch: double
ra: double
ra_error: float
dec: double
dec_error: float
parallax: double
parallax_error: float
parallax_over_error: float
pm: float
pmra: double
pmra_error: float
pmdec: double
pmdec_error: float
ra_dec_corr: float
ra_parallax_corr: float
ra_pmra_corr: float
ra_pmdec_corr: float
dec_parallax_corr: float
dec_pmra_corr: float
dec_pmdec_corr: float
parallax_pmra_corr: float
parallax_pmdec_corr: float
pmra_pmdec_corr: float
astrometric_n_obs_al: int16
astrometric_n_obs_ac: int16
astrometric_n_good_obs_al: int16
astrometric_n_bad_obs_al: int16
astrometric_gof_al: float
astrometric_chi2_al: float
astrometric_excess_noise: float
astrometric_excess_noise_sig: float
astrometric_params_solved: int8
astrometric_primary_flag: bool
nu_eff_used_in_astrometry: float
pseudocolour: float
pseudocolour_error: float
ra_pseudocolour_corr: float
dec_pseudocolour_corr: float
parallax_pseudocolour_corr: float
pmra_pseudocolour_corr: float
pmdec_pseudocolour_corr: float
astrometric_matched_transits: int16
visibility_periods_used: int16
astrometric_sigma5d_max: float
matched_transits: int16
new_matched_transits: int16
matched_transits_removed: int16
ipd_gof_harmonic_amplitude: float
ipd_gof_harmonic_phase: float
ipd_frac_multi_peak: int8
ipd_frac_odd_win: int8
ruwe: float
scan_direction_strength_k1: float
scan_direction_strength_k2: float
scan_direction_strength_k3: float
scan_direction_strength_k4: float
scan_direction_mean_k1: float
scan_direction_mean_k2: float
scan_direction_mean_k3: float
scan_direction_mean_k4: float
duplicated_source: bool
phot_g_n_obs: int16
phot_g_mean_flux: double
phot_g_mean_flux_error: float
phot_g_mean_flux_over_error: float
phot_g_mean_mag: float
phot_bp_n_obs: int16
phot_bp_mean_flux: double
phot_bp_mean_flux_error: float
phot_bp_mean_flux_over_error: float
phot_bp_mean_mag: float
phot_rp_n_obs: int16
phot_rp_mean_flux: double
phot_rp_mean_flux_error: float
phot_rp_mean_flux_over_error: float
phot_rp_mean_mag: float
phot_bp_rp_excess_factor: float
phot_bp_n_contaminated_transits: int16
phot_bp_n_blended_transits: int16
phot_rp_n_contaminated_transits: int16
phot_rp_n_blended_transits: int16
phot_proc_mode: int8
bp_rp: float
bp_g: float
g_rp: float
radial_velocity: float
radial_velocity_error: float
rv_method_used: int8
rv_nb_transits: int16
rv_nb_deblended_transits: int16
rv_visibility_periods_used: int16
rv_expected_sig_to_noise: float
rv_renormalised_gof: float
rv_chisq_pvalue: float
rv_time_duration: float
rv_amplitude_robust: float
rv_template_teff: float
rv_template_logg: float
rv_template_fe_h: float
rv_atm_param_origin: int16
vbroad: float
vbroad_error: float
vbroad_nb_transits: int16
grvs_mag: float
grvs_mag_error: float
grvs_mag_nb_transits: int16
rvs_spec_sig_to_noise: float
phot_variable_flag: string
l: double
b: double
ecl_lon: double
ecl_lat: double
in_qso_candidates: bool
in_galaxy_candidates: bool
non_single_star: int16
has_xp_continuous: bool
has_xp_sampled: bool
has_rvs: bool
has_epoch_photometry: bool
has_epoch_rv: bool
has_mcmc_gspphot: bool
has_mcmc_msc: bool
in_andromeda_survey: bool
classprob_dsc_combmod_quasar: float
classprob_dsc_combmod_galaxy: float
classprob_dsc_combmod_star: float
teff_gspphot: float
teff_gspphot_lower: float
teff_gspphot_upper: float
logg_gspphot: float
logg_gspphot_lower: float
logg_gspphot_upper: float
mh_gspphot: float
mh_gspphot_lower: float
mh_gspphot_upper: float
distance_gspphot: float
distance_gspphot_lower: float
distance_gspphot_upper: float
azero_gspphot: float
azero_gspphot_lower: float
azero_gspphot_upper: float
ag_gspphot: float
ag_gspphot_lower: float
ag_gspphot_upper: float
ebpminrp_gspphot: float
ebpminrp_gspphot_lower: float
ebpminrp_gspphot_upper: float
libname_gspphot: string
-- schema metadata --
table_meta_yaml: 'datatype:
- name: solution_id
  datatype: int64
  descr' + 26748
table::len::designation: '1'
table::len::phot_variable_flag: '1'
table::len::libname_gspphot: '1'

2.1. Column statistics#

Parquet maintains basic statistics about the data inside its files. This includes the minimum value, maximum value, and the number of null (None, or unspecified) rows for that column.

We provide a method that consumes all of the min, max, and null counts, and provides global values of min and max, and a total sum of the null counts.

[4]:
catalog_object.aggregate_column_statistics()
[4]:
min_value max_value null_count row_count
column_names
solution_id 1636148068921376768 1636148068921376768 0 1812731847
designation Gaia DR3 1000000057322000000 Gaia DR3 999999988604363776 0 1812731847
... ... ... ... ...
ebpminrp_gspphot_upper 9.999999747378752e-05 4.226200103759766 1341620712 1812731847
libname_gspphot A PHOENIX 1341620712 1812731847

152 rows × 4 columns

Again, gaia has a lot of columns. To make the most of this output, you can either use a pandas option to display all of the rows:

import pandas as pd
pd.set_option('display.max_rows', None)

Or restrict the columns to those you care about with a keyword argument:

[5]:
catalog_object.aggregate_column_statistics(include_columns=["ra", "dec", "ref_epoch"])
[5]:
min_value max_value null_count row_count
column_names
ref_epoch 2.016000e+03 2016.000000 0 1812731847
ra 3.409624e-07 360.000000 0 1812731847
dec -8.999288e+01 89.990052 0 1812731847

About#

Authors: Melissa DeLucchi

Last updated on: Oct 27, 2025

If you use lsdb for published research, please cite following instructions.