Skip to content

Implement _reduce for ArkoudaStringArray and ArkoudaCategoricalArray (pandas-aligned) #5433

@ajpotts

Description

@ajpotts

Summary

Implement the pandas ExtensionArray hook _reduce for: -
ArkoudaStringArray - ArkoudaCategoricalArray

to align with pandas behavior and avoid fallbacks that materialize data
as NumPy/object.

pandas uses _reduce to implement reductions like min, max, any,
all, and sometimes sum/prod depending on dtype. Without a correct
_reduce, pandas may: - raise TypeError/NotImplementedError - fall
back to object arrays - compute reductions client-side - lose dtype
semantics (especially around missing values)


Background / Why

pandas reduction operations on Series/Index often route through the
ExtensionArray API:

  • Series.min/max/any/all
  • Index.min/max
  • internal reductions used by groupby/joins/algorithms

For Arkouda-backed arrays, we want: - correct pandas-compatible
semantics - correct missing-value handling (skipna) - server-side
execution where possible - predictable error behavior for unsupported
reductions


pandas _reduce Contract (high-level)

Signature (pandas-private, may vary by version):

def _reduce(self, name: str, skipna: bool = True, keepdims: bool = False, **kwargs):
    ...

Behavior: - name selects the reduction (e.g., "min", "max",
"any", "all", "sum", "prod", etc.) - returns a scalar (or 1D
array if keepdims=True) - skipna controls missing-value handling -
should raise for unsupported reductions/dtypes consistently with pandas

This ticket should follow the contract pandas expects for the version(s)
Arkouda supports.


Expected Semantics

Strings

Typical pandas expectations for string reductions: - min / max:
lexicographic min/max over non-missing values - any / all: typically
not meaningful for strings; pandas may raise TypeError (confirm
baseline behavior and match) - sum / prod: not supported; should
raise TypeError/NotImplementedError

Missing value handling: - skipna=True: - ignore missing values - if
all values missing → result is missing (often pd.NA) -
skipna=False: - if any missing present → result is missing

Edge cases: - empty array: match pandas (often raises or returns missing
depending on op)

Categoricals

Categorical reductions in pandas are constrained: - min / max
supported if categories are ordered (and maybe for unordered in some
cases?) - confirm pandas baseline and match exactly - any / all:
likely unsupported (confirm and match) - sum / prod: unsupported

Missing value handling: - same skipna behavior as above (ignore vs
propagate)

Metadata: - If reduction returns a category value, return the scalar
category label (not the code), consistent with pandas.


Scope

In Scope

  • Implement _reduce for both arrays with signature compatible with
    pandas usage
  • Support at minimum:
    • Strings: min, max
    • Categoricals: min, max (with ordered semantics matching
      pandas)
  • Correctly implement skipna
  • Implement keepdims behavior if pandas calls it (return length-1
    array or scalar)
  • Add unit tests comparing Arkouda dtype reductions to pandas
    baselines

Out of Scope

  • Full support for every reduction name if pandas doesn't require it
    for these dtypes
  • Groupby reductions (pandas orchestrates, but may call _reduce on
    chunks)
  • Performance tuning beyond avoiding obvious fallbacks

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions