join

Contents

join#

Catalog.join(other: Catalog, left_on: str | None = None, right_on: str | None = None, through: AssociationCatalog | None = None, suffixes: tuple[str, str] | None = None, output_catalog_name: str | None = None, suffix_method: str | None = None, log_changes: bool = True) Catalog[source]#

Perform a spatial join to another catalog

Joins two catalogs together on a shared column value, merging rows where they match.

This is an inner join: only rows with matching join keys are returned (unmatched rows are dropped).

The operation only joins data from matching partitions, and does not join rows that have a matching column value but are in separate partitions in the sky. For a more general join, see the merge function.

Parameters:
otherCatalog

The right catalog to join to

left_onstr

The name of the column in the left catalog to join on

right_onstr

The name of the column in the right catalog to join on

throughAssociationCatalog

An association catalog that provides the alignment between pixels and individual rows.

suffixestuple[str,str]

Suffixes to apply to the columns of each table

output_catalog_namestr

The name of the resulting catalog to be stored in metadata

suffix_methodstr, default “all_columns”

Method to use to add suffixes to columns. Options are:

  • “overlapping_columns”: only add suffixes to columns that are present in both catalogs

  • “all_columns”: add suffixes to all columns from both catalogs

Warning

This default will change to “overlapping_columns” in a future release.

log_changesbool, default True

If True, logs an info message for each column that is being renamed. This only applies when suffix_method is ‘overlapping_columns’.

Returns:
Catalog

A new catalog with the columns from each of the input catalogs with their respective suffixes added, and the rows merged on the specified columns.

Examples

Join two catalogs on a shared key within the same sky partitions:

>>> import lsdb
>>> from lsdb.nested.datasets import generate_data
>>> nf = generate_data(1000, 5, seed=0, ra_range=(0.0, 300.0), dec_range=(-50.0, 50.0))
>>> base = lsdb.from_dataframe(nf.compute()[["ra", "dec", "id"]])
>>> left = base.rename({"ra": "ra_left", "dec": "dec_left"})
>>> right = base.rename({"ra": "ra_right", "dec": "dec_right", "id": "id_right"}).map_partitions(
...     lambda df: df.assign(right_flag=True)
... )
>>> joined = left.join(right, left_on="id", right_on="id_right", suffix_method="overlapping_columns")
>>> joined.head()[
...     ["ra_left", "dec_left", "id", "right_flag"]
... ]
                    ra_left   dec_left    id  right_flag
_healpix_29
118362963675428450  52.696686  39.675892  8154        True
98504457942331510   89.913567  46.147079  3437        True
70433374600953220   40.528952  35.350965  8214        True
154968715224527848   17.57041    29.8936  9853        True
67780378363846894    45.08384   31.95611  8297        True