-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
enhancementNew feature or requestNew feature or request
Description
The main idea (to be confirmed though) is to have for the user the following process:
- The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
- The user defines a schema for variable types
- The library converts the raw data files into a format used to load data in memory
- The dataset instance can return native
tf.data.Datasetortorch.Dataloaderin order to train models with this dataset
For the last point, mainly two options:
- convert dataset (
csvwithnpyfiles) tohdf5and then Apache Arrow or vaex to load it in memory - or if we want native
tf/torchtensors in the end: convert datasets into Parquet and then use petastorm
Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.
Other points:
- Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in
environment.yamlrather thanrequirements.txt - The user needs to be able to use
biodatasetspackage with either PyTorch or TF installed, so we need to manageimporterrors in bothto_torch_dataset()andto_tf_dataset()and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request