-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Summary:
I tried to reproduce the Sudoku MGDM (6M) result reported in the paper (100% ACC) using the default settings in this repo.
With model_config_tiny, diffusion_steps=20, and the default hyperparameters mentioned on Appendix C.2 MGDM IMPLEMENTATION DETAILS, my run on 8× L40S GPUs only achieves ~0.38–0.40 ACC on sudoku_test.
Code Used:
# Match default training settings in scripts/sudoku/train-mdm.sh
export WANDB_DISABLED=false
EXP=output/sudoku/mdm-alpha0.25-gamma1-bs1024-lr1e-3-ep300-T20-$(date +%Y%m%d-%H%M%S)
mkdir -p "$EXP"
MASTER_PORT=${MASTER_PORT:-20099}
DATASET_TRAIN=sudoku_train
accelerate launch --multi_gpu --num_machines 1 --mixed_precision fp16 --num_processes 8 --main_process_port ${MASTER_PORT} \
src/train_bash.py \
--stage mgdm --overwrite_output_dir \
--cache_dir ./cache \
--model_name_or_path model_config_tiny \
--do_train \
--dataset ${DATASET_TRAIN} \
--finetuning_type full \
--cutoff_len 164 \
--output_dir "$EXP" \
--overwrite_cache \
--per_device_train_batch_size 128 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 1 \
--val_size 448 \
--per_device_eval_batch_size 32 \
--evaluation_strategy steps \
--eval_steps 100 \
--save_steps 500 \
--learning_rate 1e-3 \
--num_train_epochs 300.0 \
--plot_loss \
--run_name ${DATASET_TRAIN}_prefix \
--preprocessing_num_workers 8 \
--fp16 \
--save_total_limit 1 \
--remove_unused_columns False \
--diffusion_steps 20 \
--save_safetensors False \
--token_reweighting True \
--time_reweighting linear \
--topk_decoding True \
--alpha 0.25 \
--gamma 1 \
2>&1 | tee "$EXP/train.log"
# Evaluation
for dataset in sudoku_test; do
topk_decoding=True
mkdir -p "$EXP/$dataset"
CUDA_VISIBLE_DEVICES=0 \
python3 -u src/train_bash.py \
--stage mgdm --overwrite_output_dir \
--cache_dir ./cache \
--model_name_or_path model_config_tiny \
--do_predict \
--cutoff_len 164 \
--dataset $dataset \
--finetuning_type full \
--diffusion_steps 20 \
--output_dir "$EXP/${dataset}" \
--checkpoint_dir "$EXP" \
--remove_unused_columns False \
--decoding_strategy stochastic0.5-linear \
--topk_decoding $topk_decoding \
> "$EXP/${dataset}/eval-TopK${topk_decoding}.log"
doneResult:
***** eval metrics *****
epoch = 300.0
eval_acc = 0.3839
eval_loss = 0.0506
eval_runtime = 0:00:01.63
eval_samples_per_second = 273.304
eval_steps_per_second = 1.22
Differences I’m unsure about (could explain the gap)
| Topic | Paper / README | What I used |
|---|---|---|
| Epochs / Batch / LR (tiny) | Sudoku 300 ep, batch 1024, LR=1e-3 (tiny) | 300 / 1024 / 1e-3 |
| T (diffusion steps) | T=20 if avg output tokens >20 (Sudoku has 81) | 20 |
| Decoding temperature τ | τ=0.5 for all tasks | I used --decoding_strategy stochastic0.5-linear. (Repo code seems to do argmax for x₀ and uses stochasticity only for which positions to update; there isn’t an explicit τ sampling switch. Please confirm.) |
| Reweighting | Table 3 ablates sequence/token reweighting; main text shows 6M MGDM hits 100% on Sudoku but doesn’t spell the exact reweighting combo for that point | --time_reweighting linear, --token_reweighting True, --alpha 0.25, --gamma 1 |
| Cutoff / lengths | Sudoku cutoff_len=164 (README) ([GitHub][1]) |
164 |
| Data split | First 100k train, next 1k test (paper) | Using repo’s sudoku_train / sudoku_test |
Questions:
- Could you share the exact recipe (flags) used for the “MGDM (6M) = 100% ACC on Sudoku” point—especially time_reweighting choice, token reweighting (α, γ/β), and decoding strategy (is τ=0.5 realized via categorical sampling for x₀ in your code or via a different flag)?
- Is there any difference in the Sudoku split/preprocessing between the paper and this repo’s
sudoku_train/sudoku_test(e.g., deterministic ordering vs shuffled)? I used the repo data; please confirm that matches the paper split. - In
scripts/sudoku/train-mdm.sh, what is the default--decoding_strategyyou used for Sudoku? My eval usedstochastic0.5-linear; if the intended default isdeterministic-<linear|cosine>, I can rerun to check. - Any expected seed sensitivity? If yes, how many seeds should reach 100%? Could you share your seed(s)?
- Is there any recommended environment pinning (PyTorch/Transformers/Accelerate versions) beyond
requirements.txtthat affects Sudoku?
I can provide full logs (train.log, eval.log) and W&B run links if needed.
Would appreciate guidance on what might be causing the large accuracy gap.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels