Skip to content

lib: create data_structures directory, implement hashset data structure#470

Draft
pzmarzly wants to merge 6 commits intofacebook:mainfrom
pzmarzly:push-mlqkpqspnupl
Draft

lib: create data_structures directory, implement hashset data structure#470
pzmarzly wants to merge 6 commits intofacebook:mainfrom
pzmarzly:push-mlqkpqspnupl

Conversation

@pzmarzly
Copy link
Contributor

WIP draft

See #460

A simple data hashing function.
@pzmarzly pzmarzly requested a review from qdeslandes as a code owner March 12, 2026 15:00
@pzmarzly pzmarzly marked this pull request as draft March 12, 2026 15:00
@meta-cla meta-cla bot added the cla signed label Mar 12, 2026

static uint64_t _bf_set_elem_hash(const void *data, void *ctx)
{
return bf_fnv1a(data, *(const size_t *)ctx, 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: FNV-1a hash seeded with 0 instead of BF_FNV1A_INIT

The documentation for bf_fnv1a in helper.h explicitly states "Pass @ref BF_FNV1A_INIT as @p hash for the initial call." Using 0 instead of the FNV-1a offset basis (0xcbf29ce484222325ULL) degrades hash distribution quality, leading to more collisions in the hashset. The unit test for bf_fnv1a in tests/unit/libbpfilter/helper.c correctly uses BF_FNV1A_INIT.

Same issue in tests/unit/libbpfilter/data_structures/hashset.c line 14.

Suggested change
return bf_fnv1a(data, *(const size_t *)ctx, 0);
return bf_fnv1a(data, *(const size_t *)ctx, BF_FNV1A_INIT);

static uint64_t uint32_hash(const void *data, void *ctx)
{
(void)ctx;
return bf_fnv1a(data, sizeof(uint32_t), 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: FNV-1a hash seeded with 0 instead of BF_FNV1A_INIT

Same issue as in set.c — should use BF_FNV1A_INIT to match the documented API contract and produce correct hash distribution.

Suggested change
return bf_fnv1a(data, sizeof(uint32_t), 0);
return bf_fnv1a(data, sizeof(uint32_t), BF_FNV1A_INIT);

*
* If the element already exists, no duplicate is added and 0 is returned.
* The hashset grows automatically when the load factor is exceeded.
* The hashset takes ownership of @p data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: bf_hashset_add ownership semantics are ambiguous for duplicates

The doc says "The hashset takes ownership of @p data" but when the element already exists, bf_hashset_add returns 0 without storing data — ownership is silently not taken. This is a memory management footgun for callers using the idiomatic TAKE_PTR pattern after a successful add.

The current caller (bf_set_add_elem) works around this by pre-checking with bf_hashset_contains, but future callers may not. Consider either:

  1. Updating the doc to clarify: "On successful insertion, the hashset takes ownership. If the element already exists, ownership is not transferred."
  2. Or returning -EEXIST for duplicates so callers can distinguish the two cases.

Comment on lines 399 to 400

bf_dump_prefix_push(prefix);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: bf_set_dump lost bf_dump_prefix_last call for last element

The old bf_list_foreach loop called bf_dump_prefix_last(prefix) when iterating over the last element, producing correct tree-like connector characters in dump output. The new bf_set_foreach (backed by bf_hashset_foreach) has no way to detect the last element, so the final element's prefix will show a connector instead of . This is a consistent pattern elsewhere in the codebase (see bf_chain_dump, bf_rule_dump, bf_hook_dump).

*/

/// Sentinel pointer value for a tombstoned (deleted) slot.
#define _BF_HASHSET_TOMBSTONE ((void *)1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: _BF_HASHSET_TOMBSTONE is an implementation detail exposed in the public header

This sentinel value is only needed by bf_hashset_foreach's expansion. Consider moving it to the .c file and restructuring the foreach macro to call an inline helper that encapsulates the tombstone check, keeping this internal constant out of the public API surface.

@github-actions
Copy link

github-actions bot commented Mar 12, 2026

Claude: review of facebook/bpfilter #470 (02938a3)

Must fix

  • FNV-1a hash seeded with 0 instead of BF_FNV1A_INITsrc/libbpfilter/set.c:27, tests/unit/libbpfilter/data_structures/hashset.c:14 — Fixed: all calls now use BF_FNV1A_INIT.
  • Memory leak in bf_set_add_elem when bf_hashset_add fails during growsrc/libbpfilter/set.c:441TAKE_PTR(_elem) nullifies _elem before bf_hashset_add executes. On failure (e.g., -ENOMEM from _bf_hashset_grow), the allocated copy is leaked.

Suggestions

  • bf_hashset_add ownership semantics ambiguous for duplicatessrc/libbpfilter/include/bpfilter/data_structures/hashset.h:174 — Fixed: doc now clarifies -EEXIST return and that ownership is not transferred.
  • bf_set_dump lost bf_dump_prefix_last call for last elementsrc/libbpfilter/set.c:399-400 — Fixed: now uses _dump_n counter to detect last element.
  • assert() on non-pointer value in bf_vector_newsrc/libbpfilter/data_structures/vector.c:22assert(elem_size) validates a size_t. Style guide says assert is for pointer values only; use if with error return for other validation.
  • Duplicated slot-liveness check between header and implementationsrc/libbpfilter/include/bpfilter/data_structures/hashset.h:97_bf_hashset_slot_is_occupied (header) and _bf_slot_is_live (.c) are functionally identical. Consider consolidating.
  • bf_set_get_key_comp lacks Doxygen despite non-trivial preconditionsrc/libbpfilter/include/bpfilter/set.h:149 — Has an assert(index < n_comps) precondition that callers need to know about. Other accessors in the same header are documented.

Nits

  • _BF_HASHSET_TOMBSTONE exposed in public headersrc/libbpfilter/include/bpfilter/data_structures/hashset.h:26 — Still present but related to the duplicated slot-liveness suggestion above.
  • Redundant bf_hashset_contains check in bf_set_add_elemsrc/libbpfilter/set.c:423bf_hashset_add already checks for duplicates internally (returns -EEXIST). The pre-check avoids a malloc but adds redundancy.
  • Commit 4 scope mismatch — Commit lib: set: add bf_set_foreach and bf_set_size modifies files under src/bfcli/, src/bpfilter/cgen/, and tests/ but uses the lib: component prefix.

@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 314af18 to 02938a3 Compare March 12, 2026 23:56
return 0;
return bf_hashset_is_empty(&set->elems);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Memory leak in bf_set_add_elem when bf_hashset_add fails during grow

TAKE_PTR(_elem) nullifies _elem before bf_hashset_add executes. If bf_hashset_add fails (e.g., -ENOMEM from _bf_hashset_grow), the allocated copy is lost — _elem is already NULL so _cleanup_free_ won't free it, and bf_hashset_add didn't store it.

Fix by separating the ownership transfer:

Suggested change
int r = bf_hashset_add(&set->elems, _elem);
if (r)
return r;
TAKE_PTR(_elem);
return 0;

_free_bf_vector_ struct bf_vector *_vec = NULL;

assert(vec);
assert(elem_size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: assert() on non-pointer value

assert(elem_size) validates a size_t, but the style guide states: "Only use assert() for pointer values. For other validation, use if () with appropriate error logging and return codes."

All existing assert() calls in src/libbpfilter/ validate pointers. Consider:

Suggested change
assert(elem_size);
if (!elem_size)
return -EINVAL;

*(void **)bf_vector_get(&(set)->slots, _bf_hset_idx)); \
_bf_hset_brk; _bf_hset_brk = 0)

/**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Duplicated slot-liveness check between header and .c file

This _bf_hashset_slot_is_occupied function is functionally identical to _bf_slot_is_live in hashset.c (both check slot != NULL && slot != tombstone). The header version uses the magic (void *)1 while the .c version uses _BF_HASHSET_TOMBSTONE.

Consider consolidating into a single static inline in the header and using it from both the macro and the .c file, avoiding the duplicated magic constant.

Copy link
Contributor

@qdeslandes qdeslandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, first pass of review and a few things to fix. I'll do a second, deeper pass, when those are solved. That being said, it's a very welcome addition! :D

/**
* @brief Compute a FNV-1a 64-bit hash.
*
* Pass @ref BF_FNV1A_INIT as @p hash for the initial call. To hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ref should be replaced with backticks.

Comment on lines +201 to +203
assert(data);

const uint8_t *bytes = data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const uint8_t *bytes = data;

assert(data);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fan of bpfilter/data_structures/... path. As we're moving to daemon-less, I would suggest bpfilter/core/..., which will later on contain other core structures of the project (e.g. bf_chain).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check dynbuf.c, it's the same logic but dynbuf uses elem_size = 1. I would suggest implementing bf_vector_take to take ownership of the data (so bf_vector would re-allocate memory on the next add). Then (in a subsequent commit), replace bf_dynbuf with bf_vector.

Could you also add a @todo in bf_program to replace img with a bf_vector?

_free_bf_vector_ struct bf_vector *_vec = NULL;

assert(vec);
assert(elem_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only assert pointers for NULL. elem_size should be checked and return an error on 0.

Comment on lines +73 to +76
static inline bool _bf_hashset_slot_is_occupied(void *slot)
{
return slot != NULL && slot != (void *)1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No private functions in headers.

* @param set Pointer to the hashset. Must be non-NULL.
* @param elem_var Name of the @c void* variable to hold each element.
*/
#define bf_hashset_foreach(set, elem_var) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you leverage bf_vector_foreach in here to make it simpler?


#include "test.h"

static uint64_t uint32_hash(const void *data, void *ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static symbols are prefixed with _bf_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be part of the public headers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth having a bf_hashset_take.

@qdeslandes qdeslandes mentioned this pull request Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants