Skip to content

Real dataset + low error bound = Segmentation Fault #1

@pohchaichon

Description

@pohchaichon

I cannot bulk load (train()) with the error bound used in the paper (32) as it gave out segmentation fault or in some case, assertion error. In fact, we need to push the error bound up to 1024 and much higher for some of my 'harder' dataset so it can run without those errors. On the other hand, error bound 32 is fine with synthetic lognormal dataset for example. Can you please help me fix the issue or point out something that I do wrong?

For your reference, please try the SOSD dataset (https://github.com/learnedsystems/SOSD) in particular we tested 200M key of osm: https://www.dropbox.com/s/j1d4ufn4fyb4po2/osm_cellids_800M_uint64.zst?dl=1

I adapted your code to work with uint64_t keys by changing key_type in function.h as well as adding a generic version for binary_search_branchless() in util.h. In case there are more signed int64_t hardcoded code, I also tested by turning the keys into signed format but got the same result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions