-
Notifications
You must be signed in to change notification settings - Fork 97
Description
Summary
Implement the pandas ExtensionArray hook _reduce for: -
ArkoudaStringArray - ArkoudaCategoricalArray
to align with pandas behavior and avoid fallbacks that materialize data
as NumPy/object.
pandas uses _reduce to implement reductions like min, max, any,
all, and sometimes sum/prod depending on dtype. Without a correct
_reduce, pandas may: - raise TypeError/NotImplementedError - fall
back to object arrays - compute reductions client-side - lose dtype
semantics (especially around missing values)
Background / Why
pandas reduction operations on Series/Index often route through the
ExtensionArray API:
Series.min/max/any/allIndex.min/max- internal reductions used by groupby/joins/algorithms
For Arkouda-backed arrays, we want: - correct pandas-compatible
semantics - correct missing-value handling (skipna) - server-side
execution where possible - predictable error behavior for unsupported
reductions
pandas _reduce Contract (high-level)
Signature (pandas-private, may vary by version):
def _reduce(self, name: str, skipna: bool = True, keepdims: bool = False, **kwargs):
...Behavior: - name selects the reduction (e.g., "min", "max",
"any", "all", "sum", "prod", etc.) - returns a scalar (or 1D
array if keepdims=True) - skipna controls missing-value handling -
should raise for unsupported reductions/dtypes consistently with pandas
This ticket should follow the contract pandas expects for the version(s)
Arkouda supports.
Expected Semantics
Strings
Typical pandas expectations for string reductions: - min / max:
lexicographic min/max over non-missing values - any / all: typically
not meaningful for strings; pandas may raise TypeError (confirm
baseline behavior and match) - sum / prod: not supported; should
raise TypeError/NotImplementedError
Missing value handling: - skipna=True: - ignore missing values - if
all values missing → result is missing (often pd.NA) -
skipna=False: - if any missing present → result is missing
Edge cases: - empty array: match pandas (often raises or returns missing
depending on op)
Categoricals
Categorical reductions in pandas are constrained: - min / max
supported if categories are ordered (and maybe for unordered in some
cases?) - confirm pandas baseline and match exactly - any / all:
likely unsupported (confirm and match) - sum / prod: unsupported
Missing value handling: - same skipna behavior as above (ignore vs
propagate)
Metadata: - If reduction returns a category value, return the scalar
category label (not the code), consistent with pandas.
Scope
In Scope
- Implement
_reducefor both arrays with signature compatible with
pandas usage - Support at minimum:
- Strings:
min,max - Categoricals:
min,max(with ordered semantics matching
pandas)
- Strings:
- Correctly implement
skipna - Implement
keepdimsbehavior if pandas calls it (return length-1
array or scalar) - Add unit tests comparing Arkouda dtype reductions to pandas
baselines
Out of Scope
- Full support for every reduction name if pandas doesn't require it
for these dtypes - Groupby reductions (pandas orchestrates, but may call
_reduceon
chunks) - Performance tuning beyond avoiding obvious fallbacks