Skip to content

Implement nbytes for ArkoudaStringArray and ArkoudaCategoricalArray (pandas-aligned) #5431

@ajpotts

Description

@ajpotts

Summary

Implement the nbytes property for: - ArkoudaStringArray -
ArkoudaCategoricalArray

to align with pandas ExtensionArray expectations.

Today these types either lack nbytes or return values that don't
reflect the underlying storage. pandas and downstream tooling use
nbytes for memory reporting and debugging; missing/incorrect
implementations can lead to confusing diagnostics and inconsistent
behavior compared to pandas-native StringArray/Categorical.


Background / Why

Implement ArkoudaStringArray.nbytes and ArkoudaCategorical.nbytes to align with pandas.
In pandas, nbytes represents the number of bytes consumed by the
array's data (and often key supporting buffers) in memory. It is used
in:

  • Series.memory_usage(deep=...)
  • DataFrame.memory_usage(deep=...)
  • debugging/performance profiling
  • heuristics in some internal algorithms

For Arkouda-backed arrays, "memory" is primarily server-side, but we
still need a consistent and meaningful value for nbytes that:

  • is stable and well-defined
  • is comparable across Arkouda extension dtypes
  • matches pandas expectations as closely as possible
  • does not force materialization of full client-side buffers

Requirements / Semantics

General

  • Provide @property def nbytes(self) -> int
  • Must be fast (O(1) or close), avoiding large transfers to the client
  • Should reflect the memory used by the underlying server-side
    representation

ArkoudaStringArray

nbytes should approximate the total bytes used by the string storage,
including: - the byte buffer for all string characters (or equivalent
server representation) - offsets/index buffer (if used) - missing-value
mask/buffer (if present)

If Arkouda already exposes an estimate (e.g., total bytes of string
data), use it directly rather than recomputing client-side.

ArkoudaCategoricalArray

Categorical storage typically includes: - codes buffer (int array) -
categories storage (often strings; could be numeric) - missing-value
representation (mask or -1 codes)

nbytes should include: - bytes for codes - bytes for categories (and
their buffers if string categories) - bytes for any mask buffer

Note: pandas Categorical.nbytes includes the codes and categories
sizes. Align with that notion, even if Arkouda's categories are stored
separately server-side.


Scope

In Scope

  • Add nbytes property to ArkoudaStringArray
  • Add nbytes property to ArkoudaCategoricalArray
  • Decide and document a consistent definition for "bytes" in the
    Arkouda context (server-side bytes vs client-side metadata)
  • Add tests validating:
    • property exists
    • returns an int
    • non-negative
    • behaves monotonically (adding elements increases nbytes)
    • does not require NumPy materialization

Out of Scope

  • Perfect byte-accurate accounting across all Arkouda server internals
  • Implementing pandas memory_usage(deep=True) behavior beyond
    nbytes
  • Cross-process cluster memory reporting

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions