Understanding the NestedFrame#
In this tutorial, we will:
explore the general capabilities of the DataFrame-like API that LSDB is built on, the NestedFrame
package data into nested columns using common patterns
work with nested columns using standard access and transformation approaches
Introduction#
The LSDB catalog is built on top of the pandas/dask ecosystem, where much of the look and feel of a pandas DataFrame is present in catalog operations. Pandas DataFrames are fairly ubiquitous, both in astronomy-specific applications and the broader scientific python community. If you haven’t encountered them before, you can read about them in the pandas documentation.
However, in complex and interconnected astronomical datasets, the pandas DataFrame has several limitations. The nested-pandas package was developed with these specific limitations in mind, and provides a performant extension class, the NestedFrame. The main idea of the NestedFrame, is to provide the ability to nest DataFrames within DataFrames, an example being in time-domain astronomy where the photometry table corresponding to an object table can be directly nested within a column of the object table.
The core advantages being:
No Duplication of Data: In native pandas one might be tempted to simply join the tables, but this will duplicate all columns from the object table. With many object table columns, this can severely increase the memory footprint of the full dataset.
Avoiding Grouping Operations: Nested Datasets are pre-grouped by the top-level index, avoiding grouping operations that come with storing these datasets as flat tables (for example, when doing operations that apply to a lightcurve).
Interacting with Nested Datasets as Columns: Because the DataFrames are directly nested, DataFrame operations apply to the nested data intuitively. For example, filtering based on object IDs will also remove any nested photometry for the thrown out object IDs, just as any other data in the row.
Pyarrow-backed storage and operation: DataFrames in DataFrames is the core idea, but not the actual implementation. Instead data is stored internally in pyarrow data structures, allowing performant storage and operation on nested datasets. You can read more about pyarrow from it’s documentation.
NOTE: In this tutorial, the terms “NestedFrame” and “DataFrame” will at times be used interchangeably. In general, these objects will always be NestedFrames but “DataFrame” is often a nice term to describe a single container of data.
1. The Nested Data Representation#
We can begin by walking through the basics of the NestedFrame representation. Below, we generate a toy NestedFrame which has 5 base columns(“ra”,”dec”,”id”,”a”,”b”), and 1 nested column (“nested”). In a notebook, the nested column is represented as a small inner DataFrame per row, with information on the available sub-columns (sometimes referred to as fields), a preview of the first row, and the number of additional rows.
[1]:
from lsdb import generate_data
# Generate toy data
nf = generate_data(5, 10, seed=1).compute()
# Do some reformatting for a nice output
nf = nf[["id", "ra", "dec", "a", "b", "nested"]].rename({"nested": "lightcurve"}, axis=1)
nf = nf.sort_values("id")
nf
[1]:
| id | ra | dec | a | b | lightcurve | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 342.166931 | 40.950381 | 0.720324 | 0.372520 |
|
||||||||||||
| 4 | 22 | 112.259323 | -70.888248 | 0.146756 | 1.077633 |
|
||||||||||||
| 0 | 24 | 184.255785 | -8.820946 | 0.417022 | 0.184677 |
|
||||||||||||
| 2 | 36 | 51.897461 | -10.463070 | 0.000114 | 0.691121 |
|
||||||||||||
| 3 | 46 | 341.513801 | 5.692378 | 0.302333 | 0.793535 |
|
These sub-DataFrames act as any other value stored in a DataFrame and can be accessed through normal pandas indexing:
[2]:
nf["lightcurve"][0] # look at the "lightcurve" dataframe for the first row
# or
# nf.iloc[0]["lightcurve"]
[2]:
| t | flux | band | flux_err | |
|---|---|---|---|---|
| 0 | 8.383890 | 10.233443 | g | 0.511672 |
| 1 | 13.409350 | 53.589641 | g | 2.679482 |
| 2 | 16.014891 | 90.340192 | g | 4.517010 |
| 3 | 17.892133 | 16.535420 | g | 0.826771 |
| 4 | 1.966937 | 88.330609 | g | 4.416530 |
| 5 | 6.310313 | 89.588622 | r | 4.479431 |
| 6 | 19.777222 | 11.474597 | g | 0.573730 |
| 7 | 8.957871 | 23.702698 | g | 1.185135 |
| 8 | 0.387339 | 32.664490 | g | 1.633225 |
| 9 | 1.067251 | 62.336012 | r | 3.116801 |
When represented lazily, nested columns are identified by their custom dtype, which also provides dtype information about the sub-columns:
[3]:
generate_data(5, 10).rename(columns={"nested": "lightcurve"})
[3]:
| ra | dec | id | a | b | lightcurve | |
|---|---|---|---|---|---|---|
| npartitions=1 | ||||||
| 0 | float64 | float64 | int64 | float64 | float64 | nested<t: [double], flux: [double], band: [string], flux_err: [double]> |
| 4 | ... | ... | ... | ... | ... | ... |
Beyond just visual identification, there are programmatic ways to determine nested columns:
[4]:
nf.nested_columns
[4]:
['lightcurve']
2. What We Can Do With Nested Columns#
NestedFrames are pandas DataFrames, but expanded to work directly with nested columns. This is done in a few ways:
Introduces some new access dynamics for nested columns.
Adds several new functions specific to working with nested columns.
Modifies some existing DataFrame functions to enable support for nested columns.
Sub-Column Access#
The first to touch on are the access dynamics, through nested sub-column access:
[5]:
# flux is a sub-column of the "lightcurve" column
nf["lightcurve.flux"]
[5]:
1 41.405599
1 66.379465
...
3 35.726976
3 69.089692
Name: flux, Length: 50, dtype: double[pyarrow]
Using “[column].[sub-column]” allows access to the sub-column data. In the case above, it pulls out a flat pandas series of all flux values from all nested DataFrames, with appropriate index. This sub-column access pattern is available in many pandas/LSDB functions as well and augments their behavior. For example, query:
[6]:
nf.query("lightcurve.band == 'g'")
[6]:
| id | ra | dec | a | b | lightcurve | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 342.166931 | 40.950381 | 0.720324 | 0.372520 |
|
||||||||||||
| 4 | 22 | 112.259323 | -70.888248 | 0.146756 | 1.077633 |
|
||||||||||||
| 0 | 24 | 184.255785 | -8.820946 | 0.417022 | 0.184677 |
|
||||||||||||
| 2 | 36 | 51.897461 | -10.463070 | 0.000114 | 0.691121 |
|
||||||||||||
| 3 | 46 | 341.513801 | 5.692378 | 0.302333 | 0.793535 |
|
By using “lightcurve.band” in the query string, we signal to apply the query to the nested DataFrames, in this case filtering those DataFrames by the “band” sub-column. Some operations may empty some of the sub-DataFrames, as below we apply a harsh filter on flux:
[7]:
nf_highflux = nf.query("lightcurve.flux > 95.")
nf_highflux
[7]:
| id | ra | dec | a | b | lightcurve | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 342.166931 | 40.950381 | 0.720324 | 0.372520 | None | ||||||||||||
| 4 | 22 | 112.259323 | -70.888248 | 0.146756 | 1.077633 |
|
||||||||||||
| 0 | 24 | 184.255785 | -8.820946 | 0.417022 | 0.184677 | None | ||||||||||||
| 2 | 36 | 51.897461 | -10.463070 | 0.000114 | 0.691121 |
|
||||||||||||
| 3 | 46 | 341.513801 | 5.692378 | 0.302333 | 0.793535 | None |
Empty DataFrames are represented as None values, and if we wanted to reject rows that had empty DataFrames, we could do something like this:
[8]:
nf_highflux.dropna(subset="lightcurve")
[8]:
| id | ra | dec | a | b | lightcurve | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 22 | 112.259323 | -70.888248 | 0.146756 | 1.077633 |
|
||||||||||||
| 2 | 36 | 51.897461 | -10.463070 | 0.000114 | 0.691121 |
|
For the most part, pandas functions changed in nested-pandas are modified to allow sub-column access to apply that function to the inner dataframes. Not every function has been modified, and you can refer to this page to see the list of augmented functions.
Using Nested Columns in Analysis#
Another important feature of the NestedFrame API is map_rows, which allows columns and sub-columns to be used in external analysis functions:
[ ]:
import numpy as np
# Calculate the mean of the "flux" sub-column per row of the NestedFrame
mean_flux = nf.map_rows(
np.mean, # Specify the function to apply
columns=["lightcurve.flux"], # Specify the column(s) to pass to the function,
row_container="args", # Pass columns as individual arguments
output_names=["mean_flux"],
) # Rename the output column
mean_flux
| mean_flux | |
|---|---|
| 1 | 55.903454 |
| 4 | 55.829467 |
| 0 | 47.879572 |
| 2 | 62.510849 |
| 3 | 55.587439 |
5 rows × 1 columns
A common use case is to use map_rows to calculate object-level metrics, as done above, which can then be added back to the NestedFrame:
[11]:
nf.join(mean_flux)
[11]:
| id | ra | dec | a | b | lightcurve | mean_flux | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 342.166931 | 40.950381 | 0.720324 | 0.372520 |
|
55.903454 | ||||||||||||
| 4 | 22 | 112.259323 | -70.888248 | 0.146756 | 1.077633 |
|
55.829467 | ||||||||||||
| 0 | 24 | 184.255785 | -8.820946 | 0.417022 | 0.184677 |
|
47.879572 | ||||||||||||
| 2 | 36 | 51.897461 | -10.463070 | 0.000114 | 0.691121 |
|
62.510849 | ||||||||||||
| 3 | 46 | 341.513801 | 5.692378 | 0.302333 | 0.793535 |
|
55.587439 |
We use sub-column access to signal to reduce that we’re applying mean to the “flux” sub-column of the “nested” column. Let’s look at a more complex example:
[13]:
# This time our function takes the whole row as input
def flux_ranges(row, single_band=False):
# Extract the relevant columns from the row dictionary
flux = row["lightcurve.flux"]
band = row["lightcurve.band"]
modifier = row["a"] # use the "a" column as a modifier
# Calculate the flux range in a single band with a modifier
flux = flux + modifier
# filter to just a single band
if single_band:
mask = band == single_band
flux = flux[mask]
# Returning a dictionary will apply the keys as column names
# Alternatively, supply output_names argument to map_rows
return {"min_flux": flux.min(), "max_flux": flux.max()}
# We use the "a" column as a modifier
nf.map_rows(flux_ranges, single_band="g")
[13]:
| min_flux | max_flux | |
|---|---|---|
| 1 | 2.302449 | 95.669250 |
| 4 | 27.139545 | 99.879041 |
| 0 | 10.650465 | 90.757214 |
| 2 | 13.927749 | 96.484119 |
| 3 | 41.720260 | 94.761808 |
5 rows × 2 columns
flux_ranges is still a relatively simple function, but exemplifies some of the important aspects of reduce:
We see that we were able to pass both nested sub-columns (“nested.flux”, “nested.band”) as well as a base column (“a”). In this case, “a” is used to add a flat offset to the flux array.
The function returns a dictionary, with string keys matched to the resulting data. This is nice to do, as it allows column names to be generated for the output.
Our
flux_rangesfunction has a kwarg that is not sourced from our NestedFrame. In this case, we called it at the end of the function using kwarg syntax (single_band='g'). Any arguments (not just kwargs) that are not sourced from the NestedFrame must be called in this way.
The nest Accessor#
Another NestedFrame aspect to touch on is the .nest accessor:
[14]:
nf.lightcurve.nest
[14]:
<nested_pandas.series.accessor.NestSeriesAccessor at 0x7e5bd37a65a0>
This accessor object can be called on any nested column and has a small suite of additional functions used for viewing and transforming nested data. For example, we can programmatically view the available sub-columns through the accessor:
[ ]:
nf.lightcurve.nest.columns
['t', 'flux', 'band', 'flux_err']
We can use the accessor to grab different views of our nested data:
[16]:
# View a flat version of the nested data
# All sub-dataframes are brought to one level
nf.lightcurve.nest.to_flat()
[16]:
| t | flux | band | flux_err | |
|---|---|---|---|---|
| 1 | 13.70439 | 41.405599 | g | 2.07028 |
| 1 | 8.346096 | 66.379465 | r | 3.318973 |
| ... | ... | ... | ... | ... |
| 3 | 5.310933 | 35.726976 | r | 1.786349 |
| 3 | 11.786111 | 69.089692 | g | 3.454485 |
50 rows × 4 columns
[17]:
# Or view the data as lists
nf.lightcurve.nest.to_lists()
[17]:
| t | flux | band | flux_err | |
|---|---|---|---|---|
| 1 | [13.70439001 8.34609605 19.36523151 1.700884... | [41.40559878 66.37946452 13.74747041 92.750858... | ['g' 'r' 'r' 'g' 'r' 'g' 'g' 'r' 'r' 'g'] | [2.07027994 3.31897323 0.68737352 4.6375429 3... |
| 4 | [ 0.54775186 3.96202978 17.52778305 17.562850... | [ 4.99534589 58.65550405 39.7676837 72.599798... | ['r' 'r' 'r' 'g' 'g' 'g' 'g' 'g' 'r' 'g'] | [0.24976729 2.9327752 1.98838418 3.62998993 1... |
| 0 | [ 8.38389029 13.4093502 16.01489137 17.892133... | [10.23344288 53.58964059 90.34019153 16.535419... | ['g' 'g' 'g' 'g' 'g' 'r' 'g' 'g' 'g' 'r'] | [0.51167214 2.67948203 4.51700958 0.82677099 4... |
| 2 | [ 4.08904499 11.17379657 6.26848356 0.781095... | [69.44001577 51.48891121 13.92763473 34.776585... | ['g' 'r' 'g' 'r' 'g' 'g' 'r' 'r' 'r' 'g'] | [3.47200079 2.57444556 0.69638174 1.7388293 3... |
| 3 | [17.56234873 2.80773877 13.84645231 3.396608... | [41.41792695 94.4594756 80.73912887 75.081210... | ['g' 'g' 'g' 'g' 'r' 'g' 'r' 'r' 'r' 'g'] | [2.07089635 4.72297378 4.03695644 3.75406052 1... |
3. Turning Data into Nested Columns#
In many data products, the relevant data will already be packaged as nested columns. However, it’s very possible your data will be provided in other formats that you would like to convert into nested columns. Let’s walk through a few common examples.
Converting from List Columns#
[18]:
import nested_pandas as npd
nf_lists = npd.NestedFrame(
{
"a": [10, 20, 30],
"b": ["red", "blue", "green"],
"c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
"d": [[11, 22, 33], [44, 55, 66], [77, 88, 99]],
}
)
nf_lists
[18]:
| a | b | c | d | |
|---|---|---|---|---|
| 0 | 10 | red | [1, 2, 3] | [11, 22, 33] |
| 1 | 20 | blue | [4, 5, 6] | [44, 55, 66] |
| 2 | 30 | green | [7, 8, 9] | [77, 88, 99] |
3 rows × 4 columns
In this case, “c” and “d” are list columns. Because the lengths of “c” and “d” match for each row, they can be nested together. We can achieve this with the nest_lists function:
[19]:
nf_lists.nest_lists(name="packed", columns=["c", "d"])
[19]:
| a | b | packed | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | red |
|
||||||
| 1 | 20 | blue |
|
||||||
| 2 | 30 | green |
|
Converting from Flat Tables#
Another case may be that your data is available as two separate tables:
[20]:
import nested_pandas as npd
nf_base = npd.NestedFrame({"a": [10, 20, 30], "b": ["red", "blue", "green"]})
nf_flat = npd.NestedFrame(
{"c": [1, 2, 3, 4, 5, 6, 7, 8, 9], "d": [11, 22, 33, 44, 55, 66, 77, 88, 99]},
index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)
nf_base
[20]:
| a | b | |
|---|---|---|
| 0 | 10 | red |
| 1 | 20 | blue |
| 2 | 30 | green |
3 rows × 2 columns
[21]:
nf_flat
[21]:
| c | d | |
|---|---|---|
| 0 | 1 | 11 |
| 0 | 2 | 22 |
| 0 | 3 | 33 |
| 1 | 4 | 44 |
| 1 | 5 | 55 |
| 1 | 6 | 66 |
| 2 | 7 | 77 |
| 2 | 8 | 88 |
| 2 | 9 | 99 |
9 rows × 2 columns
In this case, nf_flat has a matched index with nf_base, and we can easily nest it using join_nested:
[ ]:
nf_nested = nf_base.join_nested(nf_flat, name="packed")
nf_nested
| a | b | packed | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | red |
|
||||||
| 1 | 20 | blue |
|
||||||
| 2 | 30 | green |
|
If the index is not matched, you can set the “on” kwarg to a shared column between the two DataFrames to treat the operation more as a join.
This notebook touches on a few of the most common concepts of the NestedFrame. While this notebook works directly with the NestedFrame, operations on the LSDB catalog largely work the same. However, minor differences do exist as certain functions may not be available in the LSDB catalog API, and arguments may differ depending on LSDB’s ability to support a given operation. You’re encouraged to consult the docs for a specific function whenever you aren’t sure on it’s behavior. For more information on NestedFrames, and nested-pandas in general, consult the nested-pandas documentation.
About#
Authors: Doug Branton
Last updated /verified on: October 27, 2025
If you use lsdb for published research, please cite the following instructions.