Understanding the NestedFrame#

In this tutorial, we will:

  • explore the general capabilities of the DataFrame-like API that LSDB is built on, the NestedFrame

  • package data into nested columns using common patterns

  • work with nested columns using standard access and transformation approaches

Introduction#

The LSDB catalog is built on top of the pandas/dask ecosystem, where much of the look and feel of a pandas DataFrame is present in catalog operations. Pandas DataFrames are fairly ubiquitous, both in astronomy-specific applications and the broader scientific python community. If you haven’t encountered them before, you can read about them in the pandas documentation.

However, in complex and interconnected astronomical datasets, the pandas DataFrame has several limitations. The nested-pandas package was developed with these specific limitations in mind, and provides a performant extension class, the NestedFrame. The main idea of the NestedFrame, is to provide the ability to nest DataFrames within DataFrames, an example being in time-domain astronomy where the photometry table corresponding to an object table can be directly nested within a column of the object table.

The core advantages being:

  • No Duplication of Data: In native pandas one might be tempted to simply join the tables, but this will duplicate all columns from the object table. With many object table columns, this can severely increase the memory footprint of the full dataset.

  • Avoiding Grouping Operations: Nested Datasets are pre-grouped by the top-level index, avoiding grouping operations that come with storing these datasets as flat tables (for example, when doing operations that apply to a lightcurve).

  • Interacting with Nested Datasets as Columns: Because the DataFrames are directly nested, DataFrame operations apply to the nested data intuitively. For example, filtering based on object IDs will also remove any nested photometry for the thrown out object IDs, just as any other data in the row.

  • Pyarrow-backed storage and operation: DataFrames in DataFrames is the core idea, but not the actual implementation. Instead data is stored internally in pyarrow data structures, allowing performant storage and operation on nested datasets. You can read more about pyarrow from it’s documentation.

NOTE: In this tutorial, the terms “NestedFrame” and “DataFrame” will at times be used interchangeably. In general, these objects will always be NestedFrames but “DataFrame” is often a nice term to describe a single container of data.

1. The Nested Data Representation#

We can begin by walking through the basics of the NestedFrame representation. Below, we generate a toy NestedFrame which has 5 base columns(“ra”,”dec”,”id”,”a”,”b”), and 1 nested column (“nested”). In a notebook, the nested column is represented as a small inner DataFrame per row, with information on the available sub-columns (sometimes referred to as fields), a preview of the first row, and the number of additional rows.

[1]:
from lsdb import generate_data

# Generate toy data
nf = generate_data(5, 10, seed=1).compute()
# Do some reformatting for a nice output
nf = nf[["id", "ra", "dec", "a", "b", "nested"]].rename({"nested": "lightcurve"}, axis=1)
nf = nf.sort_values("id")
nf
[1]:
  id ra dec a b lightcurve
1 2 342.166931 40.950381 0.720324 0.372520
t flux band flux_err
13.70439 41.405599 g 2.07028
+9 rows ... ... ...
4 22 112.259323 -70.888248 0.146756 1.077633
t flux band flux_err
0.547752 4.995346 r 0.249767
+9 rows ... ... ...
0 24 184.255785 -8.820946 0.417022 0.184677
t flux band flux_err
8.38389 10.233443 g 0.511672
+9 rows ... ... ...
2 36 51.897461 -10.463070 0.000114 0.691121
t flux band flux_err
4.089045 69.440016 g 3.472001
+9 rows ... ... ...
3 46 341.513801 5.692378 0.302333 0.793535
t flux band flux_err
17.562349 41.417927 g 2.070896
+9 rows ... ... ...
5 rows x 6 columns

These sub-DataFrames act as any other value stored in a DataFrame and can be accessed through normal pandas indexing:

[2]:
nf["lightcurve"][0]  # look at the "lightcurve" dataframe for the first row
# or
# nf.iloc[0]["lightcurve"]
[2]:
t flux band flux_err
0 8.383890 10.233443 g 0.511672
1 13.409350 53.589641 g 2.679482
2 16.014891 90.340192 g 4.517010
3 17.892133 16.535420 g 0.826771
4 1.966937 88.330609 g 4.416530
5 6.310313 89.588622 r 4.479431
6 19.777222 11.474597 g 0.573730
7 8.957871 23.702698 g 1.185135
8 0.387339 32.664490 g 1.633225
9 1.067251 62.336012 r 3.116801

When represented lazily, nested columns are identified by their custom dtype, which also provides dtype information about the sub-columns:

[3]:
generate_data(5, 10).rename(columns={"nested": "lightcurve"})
[3]:
LSDB.nested NestedFrame Structure:
ra dec id a b lightcurve
npartitions=1
0 float64 float64 int64 float64 float64 nested<t: [double], flux: [double], band: [string], flux_err: [double]>
4 ... ... ... ... ... ...
Dask Name: operation, 4 expressions

Beyond just visual identification, there are programmatic ways to determine nested columns:

[4]:
nf.nested_columns
[4]:
['lightcurve']

2. What We Can Do With Nested Columns#

NestedFrames are pandas DataFrames, but expanded to work directly with nested columns. This is done in a few ways:

  • Introduces some new access dynamics for nested columns.

  • Adds several new functions specific to working with nested columns.

  • Modifies some existing DataFrame functions to enable support for nested columns.

Sub-Column Access#

The first to touch on are the access dynamics, through nested sub-column access:

[5]:
# flux is a sub-column of the "lightcurve" column
nf["lightcurve.flux"]
[5]:
1    41.405599
1    66.379465
       ...
3    35.726976
3    69.089692
Name: flux, Length: 50, dtype: double[pyarrow]

Using “[column].[sub-column]” allows access to the sub-column data. In the case above, it pulls out a flat pandas series of all flux values from all nested DataFrames, with appropriate index. This sub-column access pattern is available in many pandas/LSDB functions as well and augments their behavior. For example, query:

[6]:
nf.query("lightcurve.band == 'g'")
[6]:
  id ra dec a b lightcurve
1 2 342.166931 40.950381 0.720324 0.372520
t flux band flux_err
13.70439 41.405599 g 2.07028
+4 rows ... ... ...
4 22 112.259323 -70.888248 0.146756 1.077633
t flux band flux_err
17.56285 72.599799 g 3.62999
+5 rows ... ... ...
0 24 184.255785 -8.820946 0.417022 0.184677
t flux band flux_err
8.38389 10.233443 g 0.511672
+7 rows ... ... ...
2 36 51.897461 -10.463070 0.000114 0.691121
t flux band flux_err
4.089045 69.440016 g 3.472001
+4 rows ... ... ...
3 46 341.513801 5.692378 0.302333 0.793535
t flux band flux_err
17.562349 41.417927 g 2.070896
+5 rows ... ... ...
5 rows x 6 columns

By using “lightcurve.band” in the query string, we signal to apply the query to the nested DataFrames, in this case filtering those DataFrames by the “band” sub-column. Some operations may empty some of the sub-DataFrames, as below we apply a harsh filter on flux:

[7]:
nf_highflux = nf.query("lightcurve.flux > 95.")
nf_highflux
[7]:
  id ra dec a b lightcurve
1 2 342.166931 40.950381 0.720324 0.372520 None
4 22 112.259323 -70.888248 0.146756 1.077633
t flux band flux_err
13.995167 99.732285 g 4.986614
+0 rows ... ... ...
0 24 184.255785 -8.820946 0.417022 0.184677 None
2 36 51.897461 -10.463070 0.000114 0.691121
t flux band flux_err
16.692513 96.484005 g 4.8242
+0 rows ... ... ...
3 46 341.513801 5.692378 0.302333 0.793535 None
5 rows x 6 columns

Empty DataFrames are represented as None values, and if we wanted to reject rows that had empty DataFrames, we could do something like this:

[8]:
nf_highflux.dropna(subset="lightcurve")
[8]:
  id ra dec a b lightcurve
4 22 112.259323 -70.888248 0.146756 1.077633
t flux band flux_err
13.995167 99.732285 g 4.986614
+0 rows ... ... ...
2 36 51.897461 -10.463070 0.000114 0.691121
t flux band flux_err
16.692513 96.484005 g 4.8242
+0 rows ... ... ...
2 rows x 6 columns

For the most part, pandas functions changed in nested-pandas are modified to allow sub-column access to apply that function to the inner dataframes. Not every function has been modified, and you can refer to this page to see the list of augmented functions.

Using Nested Columns in Analysis#

Another important feature of the NestedFrame API is map_rows, which allows columns and sub-columns to be used in external analysis functions:

[ ]:
import numpy as np

# Calculate the mean of the "flux" sub-column per row of the NestedFrame
mean_flux = nf.map_rows(
    np.mean,  # Specify the function to apply
    columns=["lightcurve.flux"],  # Specify the column(s) to pass to the function,
    row_container="args",  # Pass columns as individual arguments
    output_names=["mean_flux"],
)  # Rename the output column
mean_flux
mean_flux
1 55.903454
4 55.829467
0 47.879572
2 62.510849
3 55.587439

5 rows × 1 columns

A common use case is to use map_rows to calculate object-level metrics, as done above, which can then be added back to the NestedFrame:

[11]:
nf.join(mean_flux)
[11]:
  id ra dec a b lightcurve mean_flux
1 2 342.166931 40.950381 0.720324 0.372520
t flux band flux_err
13.70439 41.405599 g 2.07028
+9 rows ... ... ...
55.903454
4 22 112.259323 -70.888248 0.146756 1.077633
t flux band flux_err
0.547752 4.995346 r 0.249767
+9 rows ... ... ...
55.829467
0 24 184.255785 -8.820946 0.417022 0.184677
t flux band flux_err
8.38389 10.233443 g 0.511672
+9 rows ... ... ...
47.879572
2 36 51.897461 -10.463070 0.000114 0.691121
t flux band flux_err
4.089045 69.440016 g 3.472001
+9 rows ... ... ...
62.510849
3 46 341.513801 5.692378 0.302333 0.793535
t flux band flux_err
17.562349 41.417927 g 2.070896
+9 rows ... ... ...
55.587439
5 rows x 7 columns

We use sub-column access to signal to reduce that we’re applying mean to the “flux” sub-column of the “nested” column. Let’s look at a more complex example:

[13]:
# This time our function takes the whole row as input
def flux_ranges(row, single_band=False):
    # Extract the relevant columns from the row dictionary
    flux = row["lightcurve.flux"]
    band = row["lightcurve.band"]
    modifier = row["a"]  # use the "a" column as a modifier

    # Calculate the flux range in a single band with a modifier
    flux = flux + modifier

    # filter to just a single band
    if single_band:
        mask = band == single_band
        flux = flux[mask]

    # Returning a dictionary will apply the keys as column names
    # Alternatively, supply output_names argument to map_rows
    return {"min_flux": flux.min(), "max_flux": flux.max()}


# We use the "a" column as a modifier
nf.map_rows(flux_ranges, single_band="g")
[13]:
min_flux max_flux
1 2.302449 95.669250
4 27.139545 99.879041
0 10.650465 90.757214
2 13.927749 96.484119
3 41.720260 94.761808

5 rows × 2 columns

flux_ranges is still a relatively simple function, but exemplifies some of the important aspects of reduce:

  1. We see that we were able to pass both nested sub-columns (“nested.flux”, “nested.band”) as well as a base column (“a”). In this case, “a” is used to add a flat offset to the flux array.

  2. The function returns a dictionary, with string keys matched to the resulting data. This is nice to do, as it allows column names to be generated for the output.

  3. Our flux_ranges function has a kwarg that is not sourced from our NestedFrame. In this case, we called it at the end of the function using kwarg syntax (single_band='g'). Any arguments (not just kwargs) that are not sourced from the NestedFrame must be called in this way.

The nest Accessor#

Another NestedFrame aspect to touch on is the .nest accessor:

[14]:
nf.lightcurve.nest
[14]:
<nested_pandas.series.accessor.NestSeriesAccessor at 0x7e5bd37a65a0>

This accessor object can be called on any nested column and has a small suite of additional functions used for viewing and transforming nested data. For example, we can programmatically view the available sub-columns through the accessor:

[ ]:
nf.lightcurve.nest.columns
['t', 'flux', 'band', 'flux_err']

We can use the accessor to grab different views of our nested data:

[16]:
# View a flat version of the nested data
# All sub-dataframes are brought to one level
nf.lightcurve.nest.to_flat()
[16]:
t flux band flux_err
1 13.70439 41.405599 g 2.07028
1 8.346096 66.379465 r 3.318973
... ... ... ... ...
3 5.310933 35.726976 r 1.786349
3 11.786111 69.089692 g 3.454485

50 rows × 4 columns

[17]:
# Or view the data as lists
nf.lightcurve.nest.to_lists()
[17]:
t flux band flux_err
1 [13.70439001 8.34609605 19.36523151 1.700884... [41.40559878 66.37946452 13.74747041 92.750858... ['g' 'r' 'r' 'g' 'r' 'g' 'g' 'r' 'r' 'g'] [2.07027994 3.31897323 0.68737352 4.6375429 3...
4 [ 0.54775186 3.96202978 17.52778305 17.562850... [ 4.99534589 58.65550405 39.7676837 72.599798... ['r' 'r' 'r' 'g' 'g' 'g' 'g' 'g' 'r' 'g'] [0.24976729 2.9327752 1.98838418 3.62998993 1...
0 [ 8.38389029 13.4093502 16.01489137 17.892133... [10.23344288 53.58964059 90.34019153 16.535419... ['g' 'g' 'g' 'g' 'g' 'r' 'g' 'g' 'g' 'r'] [0.51167214 2.67948203 4.51700958 0.82677099 4...
2 [ 4.08904499 11.17379657 6.26848356 0.781095... [69.44001577 51.48891121 13.92763473 34.776585... ['g' 'r' 'g' 'r' 'g' 'g' 'r' 'r' 'r' 'g'] [3.47200079 2.57444556 0.69638174 1.7388293 3...
3 [17.56234873 2.80773877 13.84645231 3.396608... [41.41792695 94.4594756 80.73912887 75.081210... ['g' 'g' 'g' 'g' 'r' 'g' 'r' 'r' 'r' 'g'] [2.07089635 4.72297378 4.03695644 3.75406052 1...

3. Turning Data into Nested Columns#

In many data products, the relevant data will already be packaged as nested columns. However, it’s very possible your data will be provided in other formats that you would like to convert into nested columns. Let’s walk through a few common examples.

Converting from List Columns#

[18]:
import nested_pandas as npd

nf_lists = npd.NestedFrame(
    {
        "a": [10, 20, 30],
        "b": ["red", "blue", "green"],
        "c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "d": [[11, 22, 33], [44, 55, 66], [77, 88, 99]],
    }
)
nf_lists
[18]:
a b c d
0 10 red [1, 2, 3] [11, 22, 33]
1 20 blue [4, 5, 6] [44, 55, 66]
2 30 green [7, 8, 9] [77, 88, 99]

3 rows × 4 columns

In this case, “c” and “d” are list columns. Because the lengths of “c” and “d” match for each row, they can be nested together. We can achieve this with the nest_lists function:

[19]:
nf_lists.nest_lists(name="packed", columns=["c", "d"])
[19]:
  a b packed
0 10 red
c d
1 11
+2 rows ...
1 20 blue
c d
4 44
+2 rows ...
2 30 green
c d
7 77
+2 rows ...
3 rows x 3 columns

Converting from Flat Tables#

Another case may be that your data is available as two separate tables:

[20]:
import nested_pandas as npd

nf_base = npd.NestedFrame({"a": [10, 20, 30], "b": ["red", "blue", "green"]})
nf_flat = npd.NestedFrame(
    {"c": [1, 2, 3, 4, 5, 6, 7, 8, 9], "d": [11, 22, 33, 44, 55, 66, 77, 88, 99]},
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)
nf_base
[20]:
a b
0 10 red
1 20 blue
2 30 green

3 rows × 2 columns

[21]:
nf_flat
[21]:
c d
0 1 11
0 2 22
0 3 33
1 4 44
1 5 55
1 6 66
2 7 77
2 8 88
2 9 99

9 rows × 2 columns

In this case, nf_flat has a matched index with nf_base, and we can easily nest it using join_nested:

[ ]:
nf_nested = nf_base.join_nested(nf_flat, name="packed")
nf_nested
  a b packed
0 10 red
c d
1 11
+2 rows ...
1 20 blue
c d
4 44
+2 rows ...
2 30 green
c d
7 77
+2 rows ...
3 rows x 3 columns

If the index is not matched, you can set the “on” kwarg to a shared column between the two DataFrames to treat the operation more as a join.

This notebook touches on a few of the most common concepts of the NestedFrame. While this notebook works directly with the NestedFrame, operations on the LSDB catalog largely work the same. However, minor differences do exist as certain functions may not be available in the LSDB catalog API, and arguments may differ depending on LSDB’s ability to support a given operation. You’re encouraged to consult the docs for a specific function whenever you aren’t sure on it’s behavior. For more information on NestedFrames, and nested-pandas in general, consult the nested-pandas documentation.

About#

Authors: Doug Branton

Last updated /verified on: October 27, 2025

If you use lsdb for published research, please cite the following instructions.