Skip to content

Implement equals for ArkoudaStringArray and ArkoudaCategoricalArray (pandas-aligned) #5432

@ajpotts

Description

@ajpotts

Summary

Implement the equals method for: - ArkoudaStringArray -
ArkoudaCategoricalArray

to match pandas ExtensionArray semantics.

pandas relies on .equals() for correctness checks, testing utilities,
alignment logic, and some internal fast-path decisions. Missing or
incorrect implementations can cause false negatives/positives in
comparisons and may trigger slow fallbacks (e.g., converting to
object/NumPy).


Background / Why

In pandas, ExtensionArray.equals(other) answers:

Are these two arrays the same length and do they contain equal
elements in the same positions, treating missing values as equal to
missing values?

Key points: - This is not elementwise comparison (==); it returns
a single boolean. - Missing values compare equal only when both are
missing in the same positions
. - For categoricals, equality also
depends on dtype metadata (categories/order).

This method is used in: - pandas tests/assertions (tm.assert_*
helpers) - Series.equals, Index.equals - some optimization checks
(e.g., short-circuiting operations)


Expected pandas Semantics

Strings

Two arrays are equal if: - Same length - For each position: - both
missing → equal - both non-missing and strings equal → equal - otherwise
not equal

Example: - ["a", None, "b"] equals ["a", None, "b"] → True -
["a", None, "b"] equals ["a", "x", "b"] → False - ["a", None]
equals ["a"] → False

Categoricals

Two categoricals are equal if: - Same length - Same dtype metadata
(pandas behavior requires: - same categories (typically same values and
same order) - same ordered flag) - And the codes (including
missing) match positionally

Examples: - Categorical(["a", None], categories=["a","b"]) equals same
dtype and same values → True - Same values but different categories
order → False (pandas treats dtype mismatch as not equal) - Same
categories but different ordered flag → False

Note: If pandas allows equality when categories are the same set but
different order, we should match exactly what pandas does for
Categorical.equals.


Scope

In Scope

  • Implement:
    • ArkoudaStringArray.equals(other) -> bool
    • ArkoudaCategoricalArray.equals(other) -> bool
  • Accept other as:
    • same Arkouda array type
    • pandas equivalent array type where reasonable (e.g., pandas
      StringArray/Categorical)
    • array-like (optional; if not supported, return False)
  • Ensure missing-value semantics match pandas
  • Avoid full materialization for large arrays (no .to_numpy() of
    full data)
  • Add unit tests comparing to pandas baselines

Out of Scope

  • Elementwise comparisons (==), handled elsewhere
  • Cross-dtype "coercive" equality (should generally return False)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions