Skip to content

Sudoku MGDM Reproduction Issue: Only ~30–40% ACC with Default Settings (Paper Reports 100%) #7

@quasar529

Description

@quasar529

Summary:
I tried to reproduce the Sudoku MGDM (6M) result reported in the paper (100% ACC) using the default settings in this repo.
With model_config_tiny, diffusion_steps=20, and the default hyperparameters mentioned on Appendix C.2 MGDM IMPLEMENTATION DETAILS, my run on 8× L40S GPUs only achieves ~0.38–0.40 ACC on sudoku_test.


Code Used:

# Match default training settings in scripts/sudoku/train-mdm.sh
export WANDB_DISABLED=false

EXP=output/sudoku/mdm-alpha0.25-gamma1-bs1024-lr1e-3-ep300-T20-$(date +%Y%m%d-%H%M%S)
mkdir -p "$EXP"

MASTER_PORT=${MASTER_PORT:-20099}
DATASET_TRAIN=sudoku_train

accelerate launch --multi_gpu --num_machines 1 --mixed_precision fp16 --num_processes 8 --main_process_port ${MASTER_PORT} \
  src/train_bash.py \
  --stage mgdm --overwrite_output_dir \
  --cache_dir ./cache \
  --model_name_or_path model_config_tiny \
  --do_train \
  --dataset ${DATASET_TRAIN} \
  --finetuning_type full \
  --cutoff_len 164 \
  --output_dir "$EXP" \
  --overwrite_cache \
  --per_device_train_batch_size 128 \
  --gradient_accumulation_steps 1 \
  --lr_scheduler_type cosine \
  --logging_steps 1 \
  --val_size 448 \
  --per_device_eval_batch_size 32 \
  --evaluation_strategy steps \
  --eval_steps 100 \
  --save_steps 500 \
  --learning_rate 1e-3 \
  --num_train_epochs 300.0 \
  --plot_loss \
  --run_name ${DATASET_TRAIN}_prefix \
  --preprocessing_num_workers 8 \
  --fp16 \
  --save_total_limit 1 \
  --remove_unused_columns False \
  --diffusion_steps 20 \
  --save_safetensors False \
  --token_reweighting True \
  --time_reweighting linear \
  --topk_decoding True \
  --alpha 0.25 \
  --gamma 1 \
  2>&1 | tee "$EXP/train.log"

# Evaluation
for dataset in sudoku_test; do
  topk_decoding=True
  mkdir -p "$EXP/$dataset"
  CUDA_VISIBLE_DEVICES=0 \
  python3 -u src/train_bash.py \
    --stage mgdm --overwrite_output_dir \
    --cache_dir ./cache \
    --model_name_or_path model_config_tiny \
    --do_predict \
    --cutoff_len 164 \
    --dataset $dataset \
    --finetuning_type full \
    --diffusion_steps 20 \
    --output_dir "$EXP/${dataset}" \
    --checkpoint_dir "$EXP" \
    --remove_unused_columns False \
    --decoding_strategy stochastic0.5-linear \
    --topk_decoding $topk_decoding \
    > "$EXP/${dataset}/eval-TopK${topk_decoding}.log"
done

Result:

***** eval metrics *****
  epoch                   =      300.0
  eval_acc                =     0.3839
  eval_loss               =     0.0506
  eval_runtime            = 0:00:01.63
  eval_samples_per_second =    273.304
  eval_steps_per_second   =       1.22

Differences I’m unsure about (could explain the gap)

Topic Paper / README What I used
Epochs / Batch / LR (tiny) Sudoku 300 ep, batch 1024, LR=1e-3 (tiny) 300 / 1024 / 1e-3
T (diffusion steps) T=20 if avg output tokens >20 (Sudoku has 81) 20
Decoding temperature τ τ=0.5 for all tasks I used --decoding_strategy stochastic0.5-linear. (Repo code seems to do argmax for x₀ and uses stochasticity only for which positions to update; there isn’t an explicit τ sampling switch. Please confirm.)
Reweighting Table 3 ablates sequence/token reweighting; main text shows 6M MGDM hits 100% on Sudoku but doesn’t spell the exact reweighting combo for that point --time_reweighting linear, --token_reweighting True, --alpha 0.25, --gamma 1
Cutoff / lengths Sudoku cutoff_len=164 (README) ([GitHub][1]) 164
Data split First 100k train, next 1k test (paper) Using repo’s sudoku_train / sudoku_test

Questions:

  1. Could you share the exact recipe (flags) used for the “MGDM (6M) = 100% ACC on Sudoku” point—especially time_reweighting choice, token reweighting (α, γ/β), and decoding strategy (is τ=0.5 realized via categorical sampling for x₀ in your code or via a different flag)?
  2. Is there any difference in the Sudoku split/preprocessing between the paper and this repo’s sudoku_train / sudoku_test (e.g., deterministic ordering vs shuffled)? I used the repo data; please confirm that matches the paper split.
  3. In scripts/sudoku/train-mdm.sh, what is the default --decoding_strategy you used for Sudoku? My eval used stochastic0.5-linear; if the intended default is deterministic-<linear|cosine>, I can rerun to check.
  4. Any expected seed sensitivity? If yes, how many seeds should reach 100%? Could you share your seed(s)?
  5. Is there any recommended environment pinning (PyTorch/Transformers/Accelerate versions) beyond requirements.txt that affects Sudoku?

I can provide full logs (train.log, eval.log) and W&B run links if needed.
Would appreciate guidance on what might be causing the large accuracy gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions