Skip to content

Conversation

@anderkve
Copy link
Collaborator

@anderkve anderkve commented Feb 9, 2022

As discussed in the NeutrinoBit meeting today, we sometimes need to use fairly large datasets. (The current example is that Super-K provides their 4D tabulated chi^2 function in two text files with a total size of 150 MB). And this issue will just become more frequent going forward -- e.g. we probably don't want to include every ATLAS FullLikelihood .json file in our repo directly as each of these files are a few MB.

So I think it would be good to have a cmake system for downloading these types of datasets.

In the NeutrinoBit meeting we briefly discussed how we could probably just use a "fake" backend (e.g. in the Super-K case just a BE convenience function that performs interpolation) to effectively get a downloadable dataset in the current cmake system. But I think it will be easier and less confusing to have a separate part of our cmake system properly dedicated to downloading datasets that in reality aren't connected with any backend. Typically, these are the datasets that we would put in SomeBit/data/.

This PR is a suggestion for such a cmake system. It's essentially just a new file datasets.cmake where we can register downloadable datasets, much like how we register backends in backends.cmake.

The current dummy example in datasets.cmake downloads our own CMSSM/NUHM best-fit SLHA files as a tarball from Zenodo and puts them in ExampleBit_A/data/best_fits_SLHA_1705_07935.

The generated make targets are make dataset-best_fits_SLHA_1705_07935, make nuke-dataset-best_fits_SLHA_1705_07935 and make nuke-datasets. The target make dataset-best_fits_SLHA_1705_07935 is not added if ExampleBit_A is ditched.

What do you think, @tegonzalo? (Added you as reviewer.) Also tagging @patscott for your thoughts on this.

…GAMBIT modules

	modified:   CMakeLists.txt
	modified:   cmake/cleaning.cmake
	new file:   cmake/datasets.cmake
	modified:   cmake/scripts/safe_dl.sh
@anderkve anderkve requested a review from tegonzalo February 9, 2022 01:55
@patscott
Copy link
Member

This sounds like a good idea. A few thoughts spring to mind:

  • We have a couple of dummy backends for data already: plc_data and higgsbounds_tables. It would be worth checking how these would go in the new system.
  • One of the best things about these existing dummy backends is that it is easy to declare dependencies of other backends on them. It would be worth checking that that still works as easily in the new system.
  • I'd suggest getting the non-dataset-specific code out of datasets.cmake. Maybe put it in externals.cmake or something like that. This both for clarity and symmetry with scanners.cmake and backends.cmake.

@anderkve
Copy link
Collaborator Author

  • We have a couple of dummy backends for data already: plc_data and higgsbounds_tables. It would be worth checking how these would go in the new system.
  • One of the best things about these existing dummy backends is that it is easy to declare dependencies of other backends on them. It would be worth checking that that still works as easily in the new system.

For now I'd be inclined to not incorporate these dummy backends at all into the new system. Since they represent data connected to a specific backend, I think they quite naturally belong within the backends.cmake framework, as something that eventually go into the Backends directory. The idea for the new system was cover data files that isn't part of a backend, but rather is used directly by a GAMBIT module (and thus should live in SomeBit/data). But I don't feel strongly about this -- if you think it's much neater to have all dataset downloads in datasets.cmake I'd be happy with that.

It's a question whether we should make the cmake target for SomeBit depend on its dataset targets, but I think it's probably better not to do that. I don't think the user should have to download a potentially large dataset just to compile SomeBit, since they may not be interested in using the module function that requires the data. That means however that it would be up to the module function using the data set to first check that it exists, and throw a sensible error (with the suggested make command) if it's not present.

One perhaps useful thing would be to somehow register each dataset in dataset.cmake with its corresponding GAMBIT module, so that when running cmake we could output a message saying which "make get-dataset-blah" commands should be run to get all the datasets for the non-ditched modules.

I'd suggest getting the non-dataset-specific code out of datasets.cmake. Maybe put it in externals.cmake or something like that. This both for clarity and symmetry with scanners.cmake and backends.cmake.

Good idea. Will do that.

@patscott
Copy link
Member

I agree that it's better not to make the Bits all depend on all datasets that they might not use, but I don't see a good argument for maintaining two lots of dataset targets, one associated with backends and one not. I think it would be a lot neater to just make datasets associated with backends part of datasets.cmake, and allow one to declare that any cmake target depends on any dataset cmake target if desired. Then the backends that would not work without their data can declare that they depend on a dataset, and if a module can work OK without dataset X, it just doesn't declare a dependency.

You'd then just have some other custom function/macro that you could put in datasets.cmake that allowed one to declare that modules X, Y and Z might try to make use of the dataset (for the purposes of your suggested message), independent of any actual dependencies.

@anderkve
Copy link
Collaborator Author

OK, thanks, sounds like a good plan. I'll go with that.

@tegonzalo
Copy link
Collaborator

Hi, sorry for the late reply. I agree with most of what you guys discussed. My main suggestion is to not assign datasets with modules themselves, but with module functions that need it. We can create a DATASET_REQ à la BACKEND_REQ so that the dep resolver will complain if you haven't "built" the dataset.

@patscott patscott mentioned this pull request Jun 14, 2022
@anderkve anderkve added the WIP work in progress label Dec 13, 2023
@anderkve
Copy link
Collaborator Author

anderkve commented Apr 1, 2025

Comment from Core meeting: I will get back to this, once I'm done with PR #485. We will probably keep the first version of this system simple, as in just work at the level of the cmake targets (backends and modules). Introducing a more fine-grained system, at the level of dependency resolution, can be future work. As a first concrete example I will add the DMsimp_data from ColliderBit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants