Skip to content

Update the dataset workflow with new structure/format #10

@theomeb

Description

@theomeb

The main idea (to be confirmed though) is to have for the user the following process:

  • The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
  • The user defines a schema for variable types
  • The library converts the raw data files into a format used to load data in memory
  • The dataset instance can return native tf.data.Dataset or torch.Dataloader in order to train models with this dataset

For the last point, mainly two options:

  • convert dataset (csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
  • or if we want native tf/torch tensors in the end: convert datasets into Parquet and then use petastorm

Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.

Other points:

  • Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in environment.yaml rather than requirements.txt
  • The user needs to be able to use biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions