The Dataset Modality Generator is a Python-based project designed for running inference with various Large Language Models, particularly Vision Language Models (VLMs), to generate new textual modalities for image datasets. This toolkit can be utilized for tasks such as:
- Detailed Image Description: Generating rich, objective textual descriptions of visual content.
- Description Summarization: Condensing detailed descriptions into concise summaries.
The primary motivation behind creating this generator was to enhance datasets with high-quality textual descriptions. These descriptions can serve as a crucial additional modality for downstream multimodal AI systems, potentially improving their classification accuracy or understanding capabilities. The project leverages the vLLM library for efficient model inference, enabling the use of powerful LLMs even on systems with relatively modest computational resources, including personal laptops, as well as on supercomputing clusters.
This guide uses the PlantDoc dataset (containing images of various plants with and without diseases) as a primary example to demonstrate the workflow. However, the toolkit is designed to be adaptable to other image datasets. You can integrate your own datasets by:
- Structuring your data similarly to the PlantDoc example (see Data Preparation).
- Adjusting the relevant YAML configuration files.
- Executing the provided scripts locally, or via SLURM on a supercomputer.
This README provides instructions for setting up the environment and running the project, with specific examples tailored for the PLGrid Athena supercomputer. Athena nodes typically feature NVIDIA A100 GPUs (e.g., 8 GPUs per node, each with 40GB VRAM). All SLURM resource allocation parameters (like GPU count, memory, time) can be configured within the shell scripts located in the scripts/ directory.
Below is an example of textual descriptions generated by Vision Language Models for a sample image from the PlantDoc training dataset.
Example Image from PlantDoc dataset:
| Model | Generated Image Description |
|---|---|
google/gemma-3-27b-it |
Several rounded fruits, predominantly red with areas of yellow-green, are visible amongst green leaves. The leaves exhibit varying shades of green, with some showing yellowing. One leaf displays numerous dark, irregularly shaped blotches, transitioning to gray in some areas, covering a significant portion of its surface. These blotches appear slightly raised and have indistinct edges. The leaf margins are serrated. A branch, appearing brownish-gray, supports the fruits and leaves. Small, dark spots are present on the surface of one of the fruits. The leaves have a smooth texture and an ovate shape. |
OpenGVLab/InternVL3-14B-Instruct |
The image shows a close-up of an apple tree with apples and leaves. One of the leaves has visible black spots and discoloration, which could be a sign of a fungal infection. The apples appear healthy but are partially obscured by the leaves. |
OpenGVLab/InternVL3-38B |
The leaf in the image is showing signs of a fungal infection, likely black spot disease, which is common in apple trees. This condition is characterized by dark, circular spots on the leaves. To manage it, ensure proper air circulation around the tree by pruning overcrowded branches, remove and destroy affected leaves, and consider using a fungicide if the infection is severe. Watering at the base of the plant rather than overhead can also help prevent the spread of the fungus. |
llava-hf/llava-1.5-7b-hf |
An image of a tree with many apples on it, including one with spots on it. The tree has a leaf with a blemish on it. The apples are in various stages of ripeness, with some appearing ripe and others not yet showing signs of ripeness. The image captures the beauty of nature and the growth of fruit on a tree. |
mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
The image shows a close-up of a fruit and several leaves. The fruit is round, predominantly red with some yellowish-green patches, and has a smooth texture. There are small, dark, circular spots on its surface. The leaves are ovate with serrated margins and exhibit a mix of green and yellow colors. One leaf prominently displays numerous dark, irregularly shaped spots with a slightly raised texture, scattered across its surface. The leaf veins are visible, and the overall leaf texture appears slightly wrinkled. The background includes additional fruits and leaves, all appearing similarly affected. |
microsoft/Phi-3.5-vision-instruct |
The image presents a close-up view of a single apple hanging from a tree branch, surrounded by green leaves. The apple has a reddish-pink hue with a few dark spots, indicating possible signs of disease or natural variation. The leaves exhibit a pattern of dark spots and discoloration, which could be a sign of a plant disease or pest infestation. The overall health of the apple and the leaves suggests that the tree may be suffering from a condition that could affect fruit quality and yield. |
Qwen/Qwen2.5-VL-32B-Instruct |
The image shows a close-up of a plant with visible leaves, stems, and fruits. The leaves are broad and green with some exhibiting irregular, dark brown to black lesions scattered across the surface, concentrated mostly on the upper side. These lesions vary in size and are somewhat circular or irregular in shape. The leaf texture appears smooth, and the margins are serrated. The stems are thin and brown, with visible nodes. The fruits are round, glossy, and predominantly red, with one fruit showing a noticeable dark, sunken spot near its surface. The background includes other green leaves and parts of the plant structure, indicating a natural outdoor setting. The overall appearance suggests some form of surface marking or blemish affecting both the leaves and fruits. |
HuggingFaceTB/SmolVLM2-2.2B-Instruct |
In the image, a close-up view of a red apple reveals a mottled pattern on its surface. The apple is attached to a green leaf, which exhibits significant spotting, predominantly in shades of black and green. The leaf and the apple share a common brown spot on one of their surfaces, indicative of an infected area. The background is a blur of additional green leaves, suggesting the apple is part of a larger tree. The overall appearance of the plant and its fruit, along with the specific spotting pattern, are the only visible aspects that can be described without making any assumptions or diagnoses. |
Here's an overview of the project's directory structure:
├── .env
├── loggers.py
├── main.py
├── Readme.md
├── requirements.txt
├── slurm_job.conf
├── config/
│ ├── current_vllm_experiment.yaml
│ ├── paths_config.yaml
│ ├── prompts.yaml
│ ├── vllm_experiment_plans.yaml
│ └── vllm_models_and_tests.yaml
├── config/dataset_specific_params/
│ └── plant_doc.yaml
├── data/
│ └── plant_doc/
│ ├── metadata.csv
│ ├── images/
│ │ ├── test/
│ │ │ ├── img_1.jpg
│ │ │ ├── img_2.jpg
│ │ │ ├── ...
│ │ │ └── img_n.jpg
│ │ └── train/
│ │ │ ├── img_1.jpg
│ │ │ ├── img_2.jpg
│ │ │ ├── ...
│ │ │ └── img_m.jpg
│ └── text/
│ ├── gemma3_27b_bf16_1node_4gpu/
│ │ ├── performance_metrics_gemma3_27b_bf16_1node_4gpu.csv
│ │ ├── resource_utilization_gemma3_27b_bf16_1node_4gpu.csv
│ │ ├── test/
│ │ │ ├── text_1.txt
│ │ │ ├── text_2.txt
│ │ │ ├── ...
│ │ │ └── text_n.txt
│ │ └── train/
│ │ ├── text_1.txt
│ │ ├── text_2.txt
│ │ ├── ...
│ │ └── text_m.txt
│ ├── internvl3_14b_hf_bf16_1node_4gpu/
│ ├── internvl3_38b_bf16_1node_4gpu/
│ ├── llava_bf16_1node_2gpu/
│ ├── llava_fp16_1node_1gpu/
│ ├── mistral_small31_24b_bf16_1node_4gpu/
│ ├── phi3_5_vision_bf16_1node_4gpu/
│ ├── qwen2_5vl_32b_bf16_1node_4gpu/
│ └── smolvlm2_2_2b_bf16_1node_4gpu/
├── logs/
│ └── llm_txt_gen/
├── scripts/
│ ├── run_slurm_1node_2gpu.sh
│ ├── run_slurm_1node_4gpu.sh
│ ├── run_slurm_2node_1gpu_each.sh
│ └── run_slurm_job.sh
└── src/
├── __init__.py
├── data_management/
│ ├── __init__.py
│ └── dataset_loader.py
├── pipelines/
│ ├── __init__.py
│ └── run_vllm_experiment.py
├── text_generation_services/
│ ├── __init__.py
│ ├── base_llm_client.py
│ ├── text_generation_manager.py
│ └── vllm_service_client.py
└── utils/
├── __init__.py
├── config_loader.py
└── path_constructor.py
The project's behavior is primarily controlled by several YAML configuration files located in the config/ directory:
-
config/current_vllm_experiment.yaml:- This file defines which experiment plan is currently active.
- Example:
active_experiment_plan_key: "plan_inference_all_models_plantdoc"
-
config/vllm_experiment_plans.yaml:- Contains definitions for various experiment plans. An experiment plan groups multiple tests (model configurations) to be run sequentially.
- It also specifies the
dataset_key_for_images(linking to a dataset-specific config), the number of images to process (num_images_to_process, use-1for all), and texts to summarize (num_texts_to_summarize). - Example:
plan_inference_all_models_plantdoc: description: "Inference of multiple VLMs on 4 GPUs using PlantDoc dataset." experiments_to_run: - "llava_bf16_1node_4gpu" - "gemma3_27b_bf16_1node_4gpu" # ... other experiment keys ... dataset_key_for_images: "plant_doc" num_images_to_process: -1 num_texts_to_summarize: 0
-
config/vllm_models_and_tests.yaml:- This is where individual model tests/experiments are defined. Each entry (keyed by a unique name, e.g.,
gemma3_27b_bf16_1node_4gpu) specifies:model_id: The Hugging Face model identifier.task_type: e.g.,"image-to-text"or"text-to-text".prompt_key: A key referencing a prompt definition inprompts.yaml.use_chat_template(optional, boolean): Iftrue, the system will use the Hugging FaceAutoProcessor.apply_chat_templatemethod to format the prompt. Theprompt_keyshould then point to an entry inprompts.yamlcontaininguser_instruction_text. Iffalseor absent, theuser_prompt_templatefromprompts.yamlwill be used directly (aftertextwrap.dedent().strip()).vllm_engine_args: Arguments passed to the vLLM engine (e.g.,dtype,tensor_parallel_size,max_model_len,trust_remote_code).sampling_params: Parameters for text generation (e.g.,temperature,max_tokens).slurm_config(if running viarun_slurm_job.sh): SLURM directives likegpus_per_node,account, etc. Note: The mainrun_slurm_Xnode_Ygpu.shscripts have their own SBATCH directives; theslurm_confighere is more for reference or if you adapt the submission logic further.
- Example for a model using a chat template:
gemma3_27b_bf16_1node_4gpu: model_id: "google/gemma-3-27b-it" task_type: "image-to-text" prompt_key: "plant_doc_description_chat_template" # Points to user_instruction_text use_chat_template: true vllm_engine_args: dtype: "bfloat16" trust_remote_code: True tensor_parallel_size: 4 max_model_len: 8192 sampling_params: temperature: 0.6 max_tokens: 250
- Example for a model using a raw prompt template:
llava_bf16_1node_4gpu: model_id: "llava-hf/llava-1.5-7b-hf" task_type: "image-to-text" prompt_key: "plant_doc_description_vlm" # Points to user_prompt_template # use_chat_template: false (or without this flag in the config) vllm_engine_args: dtype: "bfloat16" trust_remote_code: True tensor_parallel_size: 4 max_model_len: 4096 sampling_params: temperature: 0.6 max_tokens: 250
- This is where individual model tests/experiments are defined. Each entry (keyed by a unique name, e.g.,
-
config/prompts.yaml:- Defines reusable prompt components and full prompt templates.
- Contains a
common_instructionssection for lengthy, shared instruction blocks, referenced by an anchor (e.g.,&botanical_task_description). - Individual prompts under the
prompts:key can then either:- Reference a common instruction via
user_instruction_key: "key_from_common_instructions"(whenuse_chat_template: true). - Define a
template_format_stringand aninstruction_keyfor Python-side formatting (whenuse_chat_template: falseand you want to inject common text into a model-specific template). - Define a full
user_prompt_templatedirectly (the legacy way, or for simple prompts).
- Reference a common instruction via
- Example structure:
common_instructions: botanical_description: &botanical_task_description | You are an AI assistant... prompts: plant_doc_description_chat_template: # For use_chat_template: true user_instruction_key: "botanical_description" plant_doc_description_vlm: # For use_chat_template: false template_format_string: "USER: <image>\n{instruction}\nASSISTANT:" instruction_key: "botanical_description" plant_doc_description_vlm_phi3_vision_template: # For use_chat_template: false, specific Phi3 Vision template user_prompt_template: | <|user|> <|image_1|> <<: *botanical_task_description <|end|> <|assistant|>
-
config/dataset_specific_params/:- Contains YAML files for each dataset, e.g.,
plant_doc.yaml. - Defines
dataset_name,metadata_filename,image_subfolder, etc. - Example (
plant_doc.yaml):dataset_name: "plant_doc" metadata_filename: "metadata.csv" image_subfolder: "images" text_subfolder: "text_generated_vllm" # ... other params
- Contains YAML files for each dataset, e.g.,
-
config/paths_config.yaml:- Defines root paths for data and models.
- Example:
data_root: "data" models_root: "models" # for Hugging Face cache in this project
Inference Statistics:
For each model experiment, the following statistics are saved in its output directory (e.g., data/plant_doc/text/experiment_name_key/):
performance_metrics_experiment_name_key.csv: Contains item-level performance data, including latency, token counts, and tokens per second.resource_utilization_experiment_name_key.csv: Logs CPU, RAM, and GPU utilization at various stages of the generation process.
Generated text descriptions are saved in train/ or test/ subdirectories, corresponding to the image splits. File names like text_1.txt correspond to img_1.jpg/img_1.png from the images folder.
- Active PLGrid Account: You must have an active PLGrid account.
- Athena Access: Ensure you have access to the Athena supercomputer.
- Computing Grant: A valid computing grant with GPU resources for Athena. Replace
your_gpu_grant_namein scripts with your actual grant. - Project Files: This project's code.
- UNIX-based Terminal: For SSH and
scp. - Hugging Face Account & Token: Required to download models, especially "gated" ones. Generate a token with
readpermissions from Hugging Face Tokens.
- SSH into Athena:
(Replace
ssh your_plgrid_login@athena.cyfronet.pl
your_plgrid_login). - Check grants:
Note your active GPU grant name for Athena.
hpc-grants
Use $SCRATCH (a high-performance filesystem on PLGrid) for project files.
-
Identify
$SCRATCHpath (on Athena):echo $SCRATCH
(e.g.,
/net/tscratch/people/your_plgrid_login) -
Create a root directory for your project on
$SCRATCH(if it doesn't exist): The main project expects to be inside a directory namedllm_txt_genon$SCRATCH. Theprojectfolder itself will be insidellm_txt_gen.# This structure is assumed by slurm_job.conf mkdir -p $SCRATCH/llm_txt_gen
-
Upload your project from your computer to Athena:
- On your local machine, navigate to the directory containing your
projectfolder. - Use
scp(replaceyour_plgrid_loginand paths):# On your computer scp -r project your_plgrid_login@athena.cyfronet.pl:/net/tscratch/people/your_plgrid_login/llm_txt_gen/
- On your local machine, navigate to the directory containing your
-
Verify upload (on Athena):
cd $SCRATCH/llm_txt_gen/project ls -la
Perform these steps on an Athena worker node with GPU access.
-
Request an interactive worker node: (From Athena login node)
# Adjust --gres=gpu:X, --mem, and --time as needed for setup # Replace 'your_gpu_grant_name' srun --time=02:00:00 --mem=32G --ntasks=1 --cpus-per-task=8 --gres=gpu:1 --partition=plgrid-gpu-a100 --account=your_gpu_grant_name --pty /bin/bash
-
Load Miniconda module:
# The version might change, check with 'module avail Miniconda3' module load Miniconda3/23.3.1-0 eval "$(conda shell.bash hook)"
If prompted, run
conda init bash, thenexitthesrunsession, and start a new one. -
Configure Conda paths on
$SCRATCH: (Theslurm_job.confexpects the environment to be namedvllm_env_py310)conda config --add envs_dirs ${SCRATCH}/.conda/envs conda config --add pkgs_dirs ${SCRATCH}/.conda/pkgs
-
Create Conda environment:
conda create -p ${SCRATCH}/.conda/envs/vllm_env_py310 python=3.10 -c conda-forge -y -
Activate environment:
conda activate ${SCRATCH}/.conda/envs/vllm_env_py310Prompt should change to
(vllm_env_py310).
(On worker node, vllm_env_py310 active)
-
Navigate to your project directory:
cd $SCRATCH/llm_txt_gen/project
-
Install PyTorch with CUDA, then vLLM: Athena A100s support CUDA 12.x. vLLM often requires a recent PyTorch version.
pip install wheel # For CUDA 12.1+ (common on A100s) pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121 # Install vLLM (latest stable or a specific version compatible with your needs) pip install vllm
-
Install remaining dependencies from
requirements.txt: Therequirements.txtshould list other packages liketransformers,Pillow,pandas, etc., but excludetorch,torchvision,torchaudio, andvllmas they were installed manually.pip install -r requirements.txt
-
Verify:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}'); import vllm; print(f'vLLM: {vllm.__version__}')"
On worker node, vllm_env_py310 active:
-
Log in to Hugging Face CLI:
pip install huggingface_hub huggingface-cli login
Paste your Hugging Face token (with
readaccess). This will save the token typically to$HOME/.cache/huggingface/token. -
Ensure token is accessible via
$SCRATCHfor Slurm jobs: The Slurm scripts will setHF_HOMEto$SCRATCH/.cache/huggingface.mkdir -p $SCRATCH/.cache/huggingface if [ -f "$HOME/.cache/huggingface/token" ]; then cp "$HOME/.cache/huggingface/token" "$SCRATCH/.cache/huggingface/token" chmod 600 "$SCRATCH/.cache/huggingface/token" echo "HF token copied to $SCRATCH/.cache/huggingface/token" elif [ -f "$SCRATCH/.cache/huggingface/token" ]; then chmod 600 "$SCRATCH/.cache/huggingface/token" echo "HF token already exists in $SCRATCH/.cache/huggingface/token" else echo "WARNING: HF token not found. Ensure login was successful." fi
Alternatively, create a
.envfile in your$SCRATCH/llm_txt_gen/projectdirectory with your token:HF_TOKEN="hf_YOUR_HUGGINGFACE_TOKEN"The
main.pyscript loads this.envfile. -
Exit
srunsession:exit
-
Dataset Directory Structure:
- For each new dataset (e.g.,
mydataset), create:data/mydataset/images/train/(for training images)data/mydataset/images/test/(for test images)data/mydataset/metadata.csv
- The
metadata.csvfile is crucial. It should have at least animage_pathcolumn. Paths in this column should be relative to theimagessubfolder of your dataset directory.- Example for
data/mydataset/metadata.csv:image_path,split,other_column_if_needed train/image_001.jpg,train,some_label train/image_002.png,train,another_label test/photo_abc.jpg,test,test_label
- The
splitcolumn helps organize outputs but can be inferred from the path if structured astrain/ortest/.
- Example for
- For each new dataset (e.g.,
-
Dataset-Specific Configuration:
- Create a YAML file in
config/dataset_specific_params/, e.g.,mydataset.yaml:dataset_name: "mydataset" # Must match the folder name in data/ metadata_filename: "metadata.csv" image_subfolder: "images" text_subfolder: "text_generated_vllm" # Default output subfolder for texts # Add any other dataset-specific parameters if needed
- Create a YAML file in
-
Update
config/vllm_experiment_plans.yaml:- Define a new plan or modify an existing one.
- Set
dataset_key_for_imagesto your new dataset's key (e.g.,"mydataset"). - List the
experiments_to_run(these keys must be defined invllm_models_and_tests.yaml).
-
Update
config/vllm_models_and_tests.yaml:- Define or adjust configurations for each model experiment.
- Ensure
prompt_keypoints to a valid entry inprompts.yaml. - If using
use_chat_template: true, make sure the corresponding prompt inprompts.yamlhasuser_instruction_text. - Adjust
vllm_engine_args(liketensor_parallel_size,max_model_len) andsampling_paramsas needed for each model and your hardware.
-
Update
config/prompts.yaml:- Add or modify prompts. Use the
common_instructionssection for reusable text blocks. - For models using
use_chat_template: true, provide the core instruction underuser_instruction_key(referencing an entry incommon_instructions) oruser_instruction_textdirectly. - For models using raw templates (
use_chat_template: falseor absent), providetemplate_format_stringandinstruction_key, or a fulluser_prompt_template.
- Add or modify prompts. Use the
-
Select Active Plan:
- Edit
config/current_vllm_experiment.yamlto setactive_experiment_plan_keyto the plan you want to run.
- Edit
-
Navigate to
scripts/directory on Athena:cd $SCRATCH/llm_txt_gen/project/scripts
-
Choose and Edit the appropriate Slurm script (e.g.,
run_slurm_1node_4gpu.sh):- The scripts (
run_slurm_1node_2gpu.sh,run_slurm_1node_4gpu.sh) contain SBATCH directives for resource allocation. - Crucially, update the
#SBATCH --account=directive to your valid PLGrid GPU grant name. - You can adjust
--time,--mem,--cpus-per-taskas needed for your set of experiments. Thetensor_parallel_sizefor your models should generally match the--gres=gpu:Xvalue. - Example snippet from
run_slurm_1node_4gpu.sh:#!/bin/bash #SBATCH --job-name=llm_txt_gen #SBATCH --chdir=/net/tscratch/people/{your plgrid login}/llm_txt_gen/project #SBATCH --output=../logs/llm_txt_gen/vllm_job_%A_task_%a.out #SBATCH --error=../logs/llm_txt_gen/vllm_job_%A_task_%a.err #SBATCH --partition=plgrid-gpu-a100 #SBATCH --account=your_gpu_grant_name #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:4 # 4 GPUs #SBATCH --cpus-per-task=32 # 8 CPUs per GPU #SBATCH --mem=192G # 48GB RAM per GPU #SBATCH --time=02:00:00 # Run time # ... (rest of the script as provided, ensuring paths in slurm_job.conf are correct)
- The
slurm_job.conffile (in theprojectroot) sets variables likePROJECT_ROOT_ON_SCRATCH,CONDA_ENV_NAME, etc. Ensure these are correct.
- The scripts (
-
Submit the job:
sbatch run_slurm_1node_4gpu.sh
-
Monitor queue:
squeue -u your_plgrid_login
-
Check output/error files: These will be in
logs/llm_txt_gen/relative to your project root (e.g.,$SCRATCH/llm_txt_gen/project/logs/llm_txt_gen/). -
Check results directories: Navigate to
$SCRATCH/llm_txt_gen/project/data/your_dataset_name/text/. You should see subdirectories for each experiment key defined in your plan. Inside these, look for:train/and/ortest/subfolders containing.txtfiles with generated descriptions.performance_metrics_experiment_key.csvresource_utilization_experiment_key.csv
