Real dataset + low error bound = Segmentation Fault

I cannot bulk load (train()) with the error bound used in the paper (32) as it gave out segmentation fault or in some case, assertion error. In fact, we need to push the error bound up to 1024 and much higher for some of my 'harder' dataset so it can run without those errors. On the other hand, error bound 32 is fine with synthetic lognormal dataset for example. Can you please help me fix the issue or point out something that I do wrong?

For your reference, please try the SOSD dataset (https://github.com/learnedsystems/SOSD) in particular we tested 200M key of osm: https://www.dropbox.com/s/j1d4ufn4fyb4po2/osm_cellids_800M_uint64.zst?dl=1

I adapted your code to work with uint64_t keys by changing key_type in function.h as well as adding a generic version for binary_search_branchless() in util.h. In case there are more signed int64_t hardcoded code, I also tested by turning the keys into signed format but got the same result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real dataset + low error bound = Segmentation Fault #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Real dataset + low error bound = Segmentation Fault #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions