Suggestion: cmake system for downloading datasets #324

anderkve · 2022-02-09T01:55:06Z

As discussed in the NeutrinoBit meeting today, we sometimes need to use fairly large datasets. (The current example is that Super-K provides their 4D tabulated chi^2 function in two text files with a total size of 150 MB). And this issue will just become more frequent going forward -- e.g. we probably don't want to include every ATLAS FullLikelihood .json file in our repo directly as each of these files are a few MB.

So I think it would be good to have a cmake system for downloading these types of datasets.

In the NeutrinoBit meeting we briefly discussed how we could probably just use a "fake" backend (e.g. in the Super-K case just a BE convenience function that performs interpolation) to effectively get a downloadable dataset in the current cmake system. But I think it will be easier and less confusing to have a separate part of our cmake system properly dedicated to downloading datasets that in reality aren't connected with any backend. Typically, these are the datasets that we would put in SomeBit/data/.

This PR is a suggestion for such a cmake system. It's essentially just a new file datasets.cmake where we can register downloadable datasets, much like how we register backends in backends.cmake.

The current dummy example in datasets.cmake downloads our own CMSSM/NUHM best-fit SLHA files as a tarball from Zenodo and puts them in ExampleBit_A/data/best_fits_SLHA_1705_07935.

The generated make targets are make dataset-best_fits_SLHA_1705_07935, make nuke-dataset-best_fits_SLHA_1705_07935 and make nuke-datasets. The target make dataset-best_fits_SLHA_1705_07935 is not added if ExampleBit_A is ditched.

What do you think, @tegonzalo? (Added you as reviewer.) Also tagging @patscott for your thoughts on this.

…GAMBIT modules modified: CMakeLists.txt modified: cmake/cleaning.cmake new file: cmake/datasets.cmake modified: cmake/scripts/safe_dl.sh

patscott · 2022-02-13T22:47:37Z

This sounds like a good idea. A few thoughts spring to mind:

We have a couple of dummy backends for data already: plc_data and higgsbounds_tables. It would be worth checking how these would go in the new system.
One of the best things about these existing dummy backends is that it is easy to declare dependencies of other backends on them. It would be worth checking that that still works as easily in the new system.
I'd suggest getting the non-dataset-specific code out of datasets.cmake. Maybe put it in externals.cmake or something like that. This both for clarity and symmetry with scanners.cmake and backends.cmake.

anderkve · 2022-02-21T22:33:47Z

We have a couple of dummy backends for data already: plc_data and higgsbounds_tables. It would be worth checking how these would go in the new system.

One of the best things about these existing dummy backends is that it is easy to declare dependencies of other backends on them. It would be worth checking that that still works as easily in the new system.

For now I'd be inclined to not incorporate these dummy backends at all into the new system. Since they represent data connected to a specific backend, I think they quite naturally belong within the backends.cmake framework, as something that eventually go into the Backends directory. The idea for the new system was cover data files that isn't part of a backend, but rather is used directly by a GAMBIT module (and thus should live in SomeBit/data). But I don't feel strongly about this -- if you think it's much neater to have all dataset downloads in datasets.cmake I'd be happy with that.

It's a question whether we should make the cmake target for SomeBit depend on its dataset targets, but I think it's probably better not to do that. I don't think the user should have to download a potentially large dataset just to compile SomeBit, since they may not be interested in using the module function that requires the data. That means however that it would be up to the module function using the data set to first check that it exists, and throw a sensible error (with the suggested make command) if it's not present.

One perhaps useful thing would be to somehow register each dataset in dataset.cmake with its corresponding GAMBIT module, so that when running cmake we could output a message saying which "make get-dataset-blah" commands should be run to get all the datasets for the non-ditched modules.

I'd suggest getting the non-dataset-specific code out of datasets.cmake. Maybe put it in externals.cmake or something like that. This both for clarity and symmetry with scanners.cmake and backends.cmake.

Good idea. Will do that.

patscott · 2022-02-21T23:21:08Z

I agree that it's better not to make the Bits all depend on all datasets that they might not use, but I don't see a good argument for maintaining two lots of dataset targets, one associated with backends and one not. I think it would be a lot neater to just make datasets associated with backends part of datasets.cmake, and allow one to declare that any cmake target depends on any dataset cmake target if desired. Then the backends that would not work without their data can declare that they depend on a dataset, and if a module can work OK without dataset X, it just doesn't declare a dependency.

You'd then just have some other custom function/macro that you could put in datasets.cmake that allowed one to declare that modules X, Y and Z might try to make use of the dataset (for the purposes of your suggested message), independent of any actual dependencies.

anderkve · 2022-02-22T09:35:00Z

OK, thanks, sounds like a good plan. I'll go with that.

tegonzalo · 2022-04-07T13:29:51Z

Hi, sorry for the late reply. I agree with most of what you guys discussed. My main suggestion is to not assign datasets with modules themselves, but with module functions that need it. We can create a DATASET_REQ à la BACKEND_REQ so that the dep resolver will complain if you haven't "built" the dataset.

anderkve · 2025-04-01T09:57:51Z

Comment from Core meeting: I will get back to this, once I'm done with PR #485. We will probably keep the first version of this system simple, as in just work at the level of the cmake targets (backends and modules). Introducing a more fine-grained system, at the level of dependency resolution, can be future work. As a first concrete example I will add the DMsimp_data from ColliderBit.

Added an example cmake system for downloading datasets to be used by …

e8c3423

…GAMBIT modules modified: CMakeLists.txt modified: cmake/cleaning.cmake new file: cmake/datasets.cmake modified: cmake/scripts/safe_dl.sh

anderkve requested a review from tegonzalo February 9, 2022 01:55

tegonzalo force-pushed the master branch from 4057105 to 6aae14e Compare May 20, 2022 16:58

patscott mentioned this pull request Jun 14, 2022

Simplified DM Models #360

Merged

anderkve added the WIP work in progress label Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggestion: cmake system for downloading datasets #324

Suggestion: cmake system for downloading datasets #324

Uh oh!

anderkve commented Feb 9, 2022 •

edited

Loading

Uh oh!

patscott commented Feb 13, 2022

Uh oh!

anderkve commented Feb 21, 2022

Uh oh!

patscott commented Feb 21, 2022

Uh oh!

anderkve commented Feb 22, 2022

Uh oh!

tegonzalo commented Apr 7, 2022

Uh oh!

anderkve commented Apr 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Suggestion: cmake system for downloading datasets #324

Are you sure you want to change the base?

Suggestion: cmake system for downloading datasets #324

Uh oh!

Conversation

anderkve commented Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patscott commented Feb 13, 2022

Uh oh!

anderkve commented Feb 21, 2022

Uh oh!

patscott commented Feb 21, 2022

Uh oh!

anderkve commented Feb 22, 2022

Uh oh!

tegonzalo commented Apr 7, 2022

Uh oh!

anderkve commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anderkve commented Feb 9, 2022 •

edited

Loading

anderkve commented Apr 1, 2025 •

edited

Loading