from_dataframe

Contents

from_dataframe#

from_dataframe(dataframe: DataFrame, *, ra_column: str | None = None, dec_column: str | None = None, lowest_order: int = 0, highest_order: int = 7, drop_empty_siblings: bool = True, partition_rows: int | None = None, partition_bytes: int | None = None, margin_order: int = -1, margin_threshold: float | None = 5.0, should_generate_moc: bool = True, moc_max_order: int = 10, use_pyarrow_types: bool = True, schema: Schema | None = None, **kwargs) Catalog[source]#

Load a catalog from a Pandas Dataframe.

Note that this is only suitable for small datasets (< 1million rows and < 1GB dataframe in-memory). If you need to deal with large datasets, consider using the hats-import package: https://hats-import.readthedocs.io/

Parameters:
dataframepd.Dataframe

The catalog Pandas Dataframe.

ra_columnstr, optional

The name of the right ascension column. By default, case-insensitive versions of ‘ra’ are detected.

dec_columnstr, optional

The name of the declination column. By default, case-insensitive versions of ‘dec’ are detected.

lowest_orderint, default 0

The lowest partition order. Defaults to 0.

highest_orderint, default 7

The highest partition order. Defaults to 7.

drop_empty_siblingsbool, default True

When determining final partitionining, if 3 of 4 pixels are empty, keep only the non-empty pixel

partition_rowsint or None, default None

The desired partition size, in number of rows. Only one of partition_rows or partition_bytes should be specified.

Note: partitioning is spatial (HEALPix-based). partition_rows is a best-effort target, and the resulting number of partitions is limited by highest_order and the sky footprint of your data (e.g., if all rows fall into a single HEALPix pixel at highest_order, you will still get a single partition).

partition_bytesint or None, default None

The desired partition size, in bytes. Only one of partition_rows or partition_bytes should be specified.

Note: as with partition_rows, this is a best-effort target for spatial (HEALPix-based) partitioning and is limited by highest_order.

margin_orderint, default -1

The order at which to generate the margin cache.

margin_thresholdfloat or None, default 5

The size of the margin cache boundary, in arcseconds. If None, and margin order is not specified, the margin cache is not generated. Defaults to 5 arcseconds.

should_generate_mocbool, default True

Should we generate a MOC (multi-order coverage map) of the data. It can improve performance when joining/crossmatching to other hats-sharded datasets.

moc_max_orderint, default 10

if generating a MOC, what to use as the max order.

use_pyarrow_typesbool, default True

If True, the data is backed by pyarrow, otherwise we keep the original data types.

schemapa.Schema or None

the arrow schema to create the catalog with. If None, the schema is automatically inferred from the provided DataFrame using pa.Schema.from_pandas.

**kwargs

Arguments to pass to the creation of the catalog info.

Returns:
Catalog

Catalog object loaded from the given parameters

Raises:
ValueError

If RA/Dec columns are not found or contain NaN values.

Examples

Create a small, synthetic sky catalog and load it into LSDB:

>>> import lsdb
>>> from lsdb.nested.datasets import generate_data
>>> nf = generate_data(1000, 5, seed=0, ra_range=(0.0, 300.0), dec_range=(-50.0, 50.0))
>>> df = nf.compute()[["ra", "dec", "id"]]
>>> catalog = lsdb.from_dataframe(df, catalog_name="toy_catalog")
>>> catalog.head()
                        ra        dec    id
_healpix_29
118362963675428450  52.696686  39.675892  8154
98504457942331510   89.913567  46.147079  3437
70433374600953220   40.528952  35.350965  8214
154968715224527848   17.57041    29.8936  9853
67780378363846894    45.08384   31.95611  8297