nest_lists#
- Catalog.nest_lists(base_columns: list[str] | None = None, list_columns: list[str] | None = None, name: str = 'nested') Catalog[source]#
Creates a new catalog with a set of list columns packed into a nested column.
- Parameters:
- base_columnslist-like or None
Any columns that have non-list values in the input catalog. These will simply be kept as identical columns in the result. If None, is inferred to be all columns in the input catalog that are not considered list-value columns.
- list_columnslist-like or None
The list-value columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in
nested_name. All columns in list_columns must have pyarrow list dtypes, otherwise the operation will fail. If None, is defined as all columns not inbase_columns.
- Returns:
- Catalog
A new catalog with specified list columns nested into a new nested column.
Notes
As noted above, all columns in list_columns must have a pyarrow ListType dtype. This is needed for proper meta propagation. To convert a list column to this dtype, you can use this command structure:
nf= nf.astype({"colname": pd.ArrowDtype(pa.list_(pa.int64()))})
Where pa.int64 above should be replaced with the correct dtype of the underlying data accordingly. Additionally, it’s a known issue in Dask (dask/dask#10139) that columns with list values will by default be converted to the string type. This will interfere with the ability to recast these to pyarrow lists. We recommend setting the following dask config setting to prevent this:
dask.config.set({"dataframe.convert-string":False})