-
Notifications
You must be signed in to change notification settings - Fork 97
Description
Summary
Implement the nbytes property for: - ArkoudaStringArray -
ArkoudaCategoricalArray
to align with pandas ExtensionArray expectations.
Today these types either lack nbytes or return values that don't
reflect the underlying storage. pandas and downstream tooling use
nbytes for memory reporting and debugging; missing/incorrect
implementations can lead to confusing diagnostics and inconsistent
behavior compared to pandas-native StringArray/Categorical.
Background / Why
Implement ArkoudaStringArray.nbytes and ArkoudaCategorical.nbytes to align with pandas.
In pandas, nbytes represents the number of bytes consumed by the
array's data (and often key supporting buffers) in memory. It is used
in:
Series.memory_usage(deep=...)DataFrame.memory_usage(deep=...)- debugging/performance profiling
- heuristics in some internal algorithms
For Arkouda-backed arrays, "memory" is primarily server-side, but we
still need a consistent and meaningful value for nbytes that:
- is stable and well-defined
- is comparable across Arkouda extension dtypes
- matches pandas expectations as closely as possible
- does not force materialization of full client-side buffers
Requirements / Semantics
General
- Provide
@property def nbytes(self) -> int - Must be fast (O(1) or close), avoiding large transfers to the client
- Should reflect the memory used by the underlying server-side
representation
ArkoudaStringArray
nbytes should approximate the total bytes used by the string storage,
including: - the byte buffer for all string characters (or equivalent
server representation) - offsets/index buffer (if used) - missing-value
mask/buffer (if present)
If Arkouda already exposes an estimate (e.g., total bytes of string
data), use it directly rather than recomputing client-side.
ArkoudaCategoricalArray
Categorical storage typically includes: - codes buffer (int array) -
categories storage (often strings; could be numeric) - missing-value
representation (mask or -1 codes)
nbytes should include: - bytes for codes - bytes for categories (and
their buffers if string categories) - bytes for any mask buffer
Note: pandas Categorical.nbytes includes the codes and categories
sizes. Align with that notion, even if Arkouda's categories are stored
separately server-side.
Scope
In Scope
- Add
nbytesproperty toArkoudaStringArray - Add
nbytesproperty toArkoudaCategoricalArray - Decide and document a consistent definition for "bytes" in the
Arkouda context (server-side bytes vs client-side metadata) - Add tests validating:
- property exists
- returns an
int - non-negative
- behaves monotonically (adding elements increases
nbytes) - does not require NumPy materialization
Out of Scope
- Perfect byte-accurate accounting across all Arkouda server internals
- Implementing pandas
memory_usage(deep=True)behavior beyond
nbytes - Cross-process cluster memory reporting