Skip to content

lib: add bf_hashset#460

Closed
pzmarzly wants to merge 5 commits intofacebook:mainfrom
pzmarzly:push-kputxzpqwntl
Closed

lib: add bf_hashset#460
pzmarzly wants to merge 5 commits intofacebook:mainfrom
pzmarzly:push-kputxzpqwntl

Conversation

@pzmarzly
Copy link
Contributor

@pzmarzly pzmarzly commented Mar 6, 2026

Currently, bf_set uses bf_list under the hood. This is suboptimal performance-wise. Here I am adding bf_hashset, with the exact same API as bf_set. If cmake -DUSE_HASHSET=1 is passed, it will be
used.

  • lib: add bf_vectorbf_hashset is backed by a growable bf_vector, with overflow-safe capacity doubling.
  • lib: helper: add FNV-1a hash function — imported from another PR, a simple hash function.
  • lib: add bf_set_foreach and bf_set_size — currently we directly use the inner bf_list in many places. This is a leak of implementation details, so let's make proper getters.
  • lib: add bf_hashset — main commit. Implements a simple hashset (tombstones on deletion, grows 2x when load factor exceeds 70%, doesn't shrink, linear probing on collision). Includes bf_hashset_foreach,
    add_many/remove_many for bulk operations, and serialization support.
  • build: add USE_HASHSET option to switch to hashset implementation — adds CMake flag to build with hashset.c in place of set.c. Changes GH Actions workflow to test both set.c and hashset.c.

I'm keeping USE_HASHSET default as 0 for now.

@pzmarzly pzmarzly requested a review from qdeslandes as a code owner March 6, 2026 18:04
@meta-cla meta-cla bot added the cla signed label Mar 6, 2026
@pzmarzly pzmarzly marked this pull request as draft March 6, 2026 18:04
@pzmarzly
Copy link
Contributor Author

pzmarzly commented Mar 6, 2026

Fixes #418

@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from 28cbdfa to 57a46e9 Compare March 6, 2026 18:07
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude: review of facebook/bpfilter #460 (61d3779)

Must fix

  • break in bf_set_foreach / bf_hashset_foreach does not exit iterationsrc/libbpfilter/include/bpfilter/set.h / hashset.h — Fixed: both macros now use a break-tracking variable (_bf_set_brk / _bf_hset_brk).
  • bf_hashset_foreach hardcodes magic number for slot statussrc/libbpfilter/include/bpfilter/hashset.h — Fixed: _BF_HASHSET_SLOT_OCCUPIED now defined as a named constant in the header.
  • bf_set_foreach, bf_hashset_foreach, bf_vector_foreach missing from .clang-format ForEachMacros.clang-format — Fixed: all three macros now listed.
  • Fuzzer artifact upload loses always() guard.github/workflows/ci.yaml:129 — Fixed: now uses if: always() && matrix.use_hashset == 0.
  • bpfilter.pc.in does not propagate BF_USE_HASHSET to pkg-config consumerssrc/libbpfilter/bpfilter.pc.in:9 — Fixed: .pc.in now uses @BF_PC_EXTRA_CFLAGS@ and CMakeLists sets -DBF_USE_HASHSET when enabled.

Suggestions

  • Integer overflow in bf_vector_resize size calculationsrc/libbpfilter/vector.c — Fixed: now uses __builtin_mul_overflow.
  • Overflow risk and VLA on stack in _bf_hashset_growsrc/libbpfilter/hashset.c — Fixed: uses heap allocation (calloc) and SIZE_MAX / 2 overflow guard.
  • bf_hashset_foreach dangling-else interactionsrc/libbpfilter/include/bpfilter/hashset.h — Fixed: the if/else structure in the new macro fully matches internally.
  • Full set.h include widens header couplingsrc/libbpfilter/include/bpfilter/bpfilter.h, chain.h — Forward declarations replaced with #include <bpfilter/set.h>, pulling in transitive includes. Necessary for the BF_USE_HASHSET macro approach; consider conditionally including only when hashset is enabled and keeping the forward declaration otherwise.
  • Missing Doxygen for several bf_hashset_* functionssrc/libbpfilter/include/bpfilter/hashset.h — Fixed: all public functions now have Doxygen.
  • Commit 5 message not imperative mood — Fixed: now reads build: add USE_HASHSET option to switch to hashset implementation.
  • Commits 1, 3, 4, 5 lack commit descriptions — Fixed: all commits now have descriptive bodies.
  • bf_hashset_add return value discarded during rehashsrc/libbpfilter/hashset.c — Fixed: return value now checked, error propagated, and old state restored on failure.
  • Duplicate slot-status constant without sync guardsrc/libbpfilter/hashset.c — Fixed: static_assert(_BF_SLOT_OCCUPIED == _BF_HASHSET_SLOT_OCCUPIED, ...) now keeps them in sync.
  • _bf_hashset_grow rehash failure leaves set in corrupt statesrc/libbpfilter/hashset.c — Fixed: old_len and old_n_used are saved and restored on rehash failure.
  • Cap doubling overflow guard missing in bf_vector_addsrc/libbpfilter/vector.c:85 — Fixed: vec->cap > SIZE_MAX / 2 check added before doubling.
  • Missing dedicated unit tests for core hashset operationstests/unit/libbpfilter/hashset.c — Fixed: dedicated tests now exist for bf_hashset_contains, bf_hashset_remove, bf_hashset_is_empty, bf_hashset_cap, bf_hashset_foreach, including tombstone and duplicate scenarios.
  • _bf_hashset_grow populates new table with repeated vector doublingssrc/libbpfilter/hashset.c:95 — Fixed: now uses bf_vector_resize + memset with one allocation.
  • Missing Doxygen on _bf_hashset_parse_key and _bf_hashset_cmp_key_formatsrc/libbpfilter/hashset.c — Fixed: both static functions now have Doxygen documentation.
  • CI build job does not test USE_HASHSET=1 across distributions.github/workflows/ci.yaml — The build job validates compilation across all distributions but only with USE_HASHSET=0. The hashset path is only compiled inside the test job on Fedora 43. Distribution-specific build failures under USE_HASHSET=1 would be missed.
  • Direct manipulation of bf_vector internals in _bf_hashset_growsrc/libbpfilter/hashset.c — Fixed: now uses bf_vector_data() and bf_vector_set_len() instead of direct field access.
  • No test exercises the grow/rehash pathtests/unit/libbpfilter/hashset.c — Fixed: add_triggers_grow test inserts 20 elements, triggering grow past the initial capacity of 16.
  • No test for bf_hashset_foreach break semanticstests/unit/libbpfilter/hashset.c — Fixed: foreach_break test breaks after first iteration and asserts exactly one iteration.
  • Unbounded probe loop in bf_hashset_add_elemsrc/libbpfilter/hashset.c — Fixed: now uses a bounded for loop with i < _bf_n_slots(set), consistent with _bf_hashset_find.
  • Commit bodies describe 'what' not 'why' — Commits 1 and 2 bodies describe what the new code is rather than why it is needed. The style guide says "Use the commit description to explain why the change is necessary." E.g. commit 1 could mention that bf_vector provides backing storage for the upcoming bf_hashset.
  • _bf_hashset_grow performs redundant work during rehashsrc/libbpfilter/hashset.c — Fixed: _bf_hashset_insert_unchecked helper now skips duplicate-check and load-factor evaluation during rehash.
  • hashset.h pulls in heavyweight transitive includessrc/libbpfilter/include/bpfilter/hashset.h:12-14dump.h, matcher.h, pack.h are only needed for function declarations, not for the struct or bf_hashset_foreach. Forward-declare and move includes to hashset.c.
  • hashset.c unconditionally compiled when USE_HASHSET=0src/libbpfilter/CMakeLists.txt:67set.c is conditionally excluded when USE_HASHSET=1, but hashset.c remains unconditionally compiled. In the default build, both are compiled and linked, adding ~730 lines of unused code. Consider guarding hashset.c symmetrically: if (USE_HASHSET) list(APPEND ... hashset.c) endif().

Nits

  • Doxygen comments missing @brief tag — Fixed.
  • Line exceeds 80 characters — Fixed.
  • Indented preprocessor directives — Fixed.
  • bf_set_contains mapping has no non-hashset counterpart — Fixed: bf_set_contains now implemented in set.c.
  • bf_vector_clean doc suggests incorrect reuse path — Fixed: now references bf_vector_default.
  • BF_FNV1A_INIT and BF_FNV1A_PRIME lack Doxygen annotations — Fixed: now use /// Doxygen comments.
  • Log messages in bf_hashset_new_from_pack reference bf_set — Fixed: pack-error labels now use bf_hashset.*.
  • bft_hashset_eq returns false when both names are NULL — Fixed: now uses lhs->name != rhs->name && (!lhs->name || !rhs->name || strcmp(...)) which correctly handles both-NULL.
  • bf_set_foreach placed before struct bf_set definition — Fixed: now placed after struct bf_set.
  • Inconsistent CMake condition stylesrc/libbpfilter/CMakeLists.txt:67if (NOT USE_HASHSET) uses bare variable name while the root CMakeLists.txt convention is ${VAR} form (e.g., if (NOT ${NO_DOCS})). Same in tests/unit/CMakeLists.txt.
  • static_assert inside function body — Fixed: both static_asserts now at file scope.
  • BF_HASHSET_MAX_N_COMPS missing doc comment — Fixed: now has /// doc comment.
  • Commit 3 scope extends beyond 'lib' component — Commit 3 is titled lib: add bf_set_foreach and bf_set_size but also modifies src/bfcli/print.c (cli), src/bpfilter/cgen/prog/map.c and src/bpfilter/cgen/program.c (daemon), and test files. Consider splitting or adjusting the title.
  • bf_hashset_new_from_pack uses implicit i from bf_rpack_array_foreach — Fixed: now uses key[n_comps - 1].
  • Direct struct field access bypasses accessor patternsrc/bpfilter/cgen/program.c, src/bpfilter/cgen/prog/map.cset->elem_size and set->use_trie are accessed directly while the PR introduces bf_set_get_* accessors elsewhere. Consider adding bf_set_get_elem_size / bf_set_get_use_trie for completeness.
  • bf_hashset_foreach uses function calls in loop conditionsrc/libbpfilter/include/bpfilter/hashset.h:233-245 — Calls bf_hashset_cap(set) on every outer-loop iteration and bf_vector_get(...) twice per slot. bf_list_foreach and bf_vector_foreach access struct fields directly. Consider using (set)->slots.len and (set)->slots.data for consistency.
  • Bare #endif without identifying commentsrc/libbpfilter/include/bpfilter/set.h — The #ifdef BF_USE_HASHSET / #else / #endif block spans the entire file. Adding #endif /* BF_USE_HASHSET */ would improve navigability.

CLAUDE.md improvements

  • assert() on non-pointer values (e.g., assert(elem_size) in vector.c:22, assert(len <= vec->cap) in vector.c:141) appears in 3+ existing places in the codebase (helper.c, set.c, bpf.c, matcher.c). The style guide says to use assert() only for pointer values — consider clarifying whether this is enforced for new code.

@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from 57a46e9 to 6b64c48 Compare March 9, 2026 15:24
@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from 6b64c48 to 783f5a3 Compare March 9, 2026 16:24
@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from 783f5a3 to c480a27 Compare March 9, 2026 17:07
@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from c480a27 to 516ba3c Compare March 9, 2026 17:40
@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from 516ba3c to 3821d0d Compare March 9, 2026 18:20
@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from 3821d0d to 72eab46 Compare March 10, 2026 12:23
@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch 2 times, most recently from 5f6eab6 to a49465f Compare March 10, 2026 14:24
@yaakov-stein
Copy link
Contributor

yaakov-stein commented Mar 10, 2026

Thanks for working on this! I have some architectural thoughts that I'd be interested to hear your take on.

Right now, in libbpfilter, the idea of bf_list and bf_set were (at least to me) a tad bit confusing initially. Specifically:

  • bf_list: This is a generic data structure that represents a linked list
  • bf_set: This is a representation of a bpfilter set, which is used to construct the BPF map that will function as a true set. This is not a data structure really, it's just a representation of a bpfilter concept with the name of a data structure.

Why does this matter?

  1. I think the addition of bf_hashset should be added as a generic data structure, as opposed to taking the place of bf_set.
  2. I think it'd be helpful to have a subfolder or something where data structures are located, as opposed to with all of the bpfilter representations. I'm not sure if this is necessary now, but it may be helpful as more are added.

My understanding of the current approach is that bf_hashset is a drop-in replacement for bf_set, but this blurs the line further between generic data structures and representations of a bpfilter concept. This also means the bf_hashset isn't generic in this approach and can't be used for other future use cases, as someone contributing or using the repo might imagine.

My opinion would be to add bf_hashset (and bf_vector) as generic data structures, and then have bf_set use bf_hashset internally for storage, rather than becoming bf_hashset via macro swap. This keeps the domain concept (bf_set - a set with matcher type keys) separate from the data structure (bf_hashset).

Two other (possibly connected) smaller points:

  1. The current behavior (pre-hashset) is that you get the same ordering for a set each time you print it out or make a call to get a chain. With this change, we'd lose that ordering. As a user, this may be annoying (not 100% sure though). While I haven't thoroughly thought this through, I think we can store insertion order data optionally in the hashset data structure and offer an API that allows us to get the ordered data (every operation would be O(n) of course to get ordered data).
  2. Is the build flag necessary? Besides for point 1, why would a user have a preference to not use a hashset?

@pzmarzly @qdeslandes Interested to hear your thoughts!

@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from a49465f to b07b9dc Compare March 11, 2026 14:53
pzmarzly and others added 5 commits March 12, 2026 13:23
Dynamically-sized array of fixed-size elements.
Elements are stored inline in a single contiguous allocation.
The vector doubles in capacity on overflow.
A simple data hashing function.
Callers previously accessed the inner bf_list directly. Provide proper
iteration and size accessors so the set implementation can be swapped
without touching every call site.
Open-addressing hashset with linear probing backed by bf_vector.
Elements are fixed-size blobs hashed with FNV-1a and compared with
memcmp. Removed elements leave tombstones; the table grows at 70% load
but never shrinks.

This provides an O(1) average-case alternative to the linked-list-based
bf_set for large element counts.
When cmake -DUSE_HASHSET=1 is passed, set.h redirects all bf_set_*
symbols to their bf_hashset_* counterparts via preprocessor macros.
The CI workflow is extended to test both paths. USE_HASHSET defaults
to off for now.
@pzmarzly
Copy link
Contributor Author

@yaakov-stein i like your suggestion a lot. Initially I thought bf_set will have most code dedicated to storing data, and some code related to {name,key_format,use_trie}, but when I started implementing it, I realized it's the other way around. But by that point I so far in so I didn't reconsider splitting it like you suggested.

I'll try splitting the "low-level" code into data_structures directory, and then just switch set.c from bf_list to bf_hashset. Though I think I'll make a separate branch/PR for trying out that approach.

Is the build flag necessary?

Probably not. I thought that this way I will surely not block the daemonless work in any way, and that we can flip the flag once @qdeslandes's work is complete, but now I see that it may actually cause more merge conflicts for little gain.

@pzmarzly pzmarzly force-pushed the push-kputxzpqwntl branch from b07b9dc to 61d3779 Compare March 12, 2026 14:31
Comment on lines +233 to +245
* @brief Iterate over all occupied elements in a hashset.
*
* @param set Pointer to the hashset to iterate over. Must be non-NULL.
* @param elem_var Name of the void* variable to hold each element.
*/
#define bf_hashset_foreach(set, elem_var) \
for (size_t _bf_hset_idx = 0, _bf_hset_brk = 0; \
_bf_hset_idx < bf_hashset_cap(set) && !_bf_hset_brk; ++_bf_hset_idx) \
if (*(uint8_t *)bf_vector_get(&(set)->slots, _bf_hset_idx) != \
_BF_HASHSET_SLOT_OCCUPIED) \
continue; \
else \
for (void *(elem_var) = \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: bf_hashset_foreach uses function calls in loop condition unlike other foreach macros (suggestion)

The macro calls bf_hashset_cap(set) on every outer-loop iteration and bf_vector_get(...) twice per occupied slot (once for the status check, once for the element pointer). By contrast, bf_list_foreach and bf_vector_foreach access struct fields directly (e.g., (list)->head, (vec)->data, (vec)->len) without function-call overhead.

Consider accessing (set)->slots fields directly for consistency:

  • Use (set)->slots.len instead of bf_hashset_cap(set)
  • Compute slot pointers from (set)->slots.data instead of calling bf_vector_get()

@qdeslandes
Copy link
Contributor

Is this still required as we have #470 ?

@pzmarzly
Copy link
Contributor Author

Closing, #470 is better

@pzmarzly pzmarzly closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants