Sudoku MGDM Reproduction Issue: Only ~30–40% ACC with Default Settings (Paper Reports 100%)

**Summary:**
I tried to reproduce the Sudoku MGDM (6M) result reported in the paper (100% ACC) using the default settings in this repo.
With `model_config_tiny`, `diffusion_steps=20`, and the default hyperparameters mentioned on Appendix C.2 MGDM IMPLEMENTATION DETAILS, my run on **8× L40S** GPUs only achieves **\~0.38–0.40 ACC** on `sudoku_test`.

---

**Code Used:**

```bash
# Match default training settings in scripts/sudoku/train-mdm.sh
export WANDB_DISABLED=false

EXP=output/sudoku/mdm-alpha0.25-gamma1-bs1024-lr1e-3-ep300-T20-$(date +%Y%m%d-%H%M%S)
mkdir -p "$EXP"

MASTER_PORT=${MASTER_PORT:-20099}
DATASET_TRAIN=sudoku_train

accelerate launch --multi_gpu --num_machines 1 --mixed_precision fp16 --num_processes 8 --main_process_port ${MASTER_PORT} \
  src/train_bash.py \
  --stage mgdm --overwrite_output_dir \
  --cache_dir ./cache \
  --model_name_or_path model_config_tiny \
  --do_train \
  --dataset ${DATASET_TRAIN} \
  --finetuning_type full \
  --cutoff_len 164 \
  --output_dir "$EXP" \
  --overwrite_cache \
  --per_device_train_batch_size 128 \
  --gradient_accumulation_steps 1 \
  --lr_scheduler_type cosine \
  --logging_steps 1 \
  --val_size 448 \
  --per_device_eval_batch_size 32 \
  --evaluation_strategy steps \
  --eval_steps 100 \
  --save_steps 500 \
  --learning_rate 1e-3 \
  --num_train_epochs 300.0 \
  --plot_loss \
  --run_name ${DATASET_TRAIN}_prefix \
  --preprocessing_num_workers 8 \
  --fp16 \
  --save_total_limit 1 \
  --remove_unused_columns False \
  --diffusion_steps 20 \
  --save_safetensors False \
  --token_reweighting True \
  --time_reweighting linear \
  --topk_decoding True \
  --alpha 0.25 \
  --gamma 1 \
  2>&1 | tee "$EXP/train.log"

# Evaluation
for dataset in sudoku_test; do
  topk_decoding=True
  mkdir -p "$EXP/$dataset"
  CUDA_VISIBLE_DEVICES=0 \
  python3 -u src/train_bash.py \
    --stage mgdm --overwrite_output_dir \
    --cache_dir ./cache \
    --model_name_or_path model_config_tiny \
    --do_predict \
    --cutoff_len 164 \
    --dataset $dataset \
    --finetuning_type full \
    --diffusion_steps 20 \
    --output_dir "$EXP/${dataset}" \
    --checkpoint_dir "$EXP" \
    --remove_unused_columns False \
    --decoding_strategy stochastic0.5-linear \
    --topk_decoding $topk_decoding \
    > "$EXP/${dataset}/eval-TopK${topk_decoding}.log"
done
```

---

**Result:**

```
***** eval metrics *****
  epoch                   =      300.0
  eval_acc                =     0.3839
  eval_loss               =     0.0506
  eval_runtime            = 0:00:01.63
  eval_samples_per_second =    273.304
  eval_steps_per_second   =       1.22
```

**Differences I’m unsure about (could explain the gap)**

| Topic                      | Paper / README                                                                                                                                            | What I used                                                                                                                                                                                                        |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Epochs / Batch / LR (tiny) | Sudoku 300 ep, batch 1024, LR=1e-3 (tiny)                                                                                                                 | 300 / 1024 / 1e-3                                                                                                                                                                                                  |
| T (diffusion steps)        | T=20 if avg output tokens >20 (Sudoku has 81)                                                                                                             | 20                                                                                                                                                                                                                 |
| Decoding temperature τ     | τ=0.5 for all tasks                                                                                                                                       | I used `--decoding_strategy stochastic0.5-linear`. (Repo code seems to do argmax for x₀ and uses stochasticity only for **which positions** to update; there isn’t an explicit τ sampling switch. Please confirm.) |
| Reweighting                | Table 3 ablates sequence/token reweighting; main text shows 6M MGDM hits 100% on Sudoku but doesn’t spell the **exact** reweighting combo for that point  | `--time_reweighting linear`, `--token_reweighting True`, `--alpha 0.25`, `--gamma 1`                                                                                                                               |
| Cutoff / lengths           | Sudoku `cutoff_len=164` (README) ([GitHub][1])                                                                                                            | 164                                                                                                                                                                                                                |
| Data split                 | First 100k train, next 1k test (paper)                                                                                                                    | Using repo’s `sudoku_train` / `sudoku_test`                                                                                                                                                                        |
---

**Questions:**

1. Could you share the **exact recipe** (flags) used for the “MGDM (6M) = 100% ACC on Sudoku” point—especially **time\_reweighting** choice, **token reweighting (α, γ/β)**, and **decoding strategy** (is τ=0.5 realized via categorical sampling for x₀ in your code or via a different flag)?&#x20;
2. Is there any **difference in the Sudoku split/preprocessing** between the paper and this repo’s `sudoku_train` / `sudoku_test` (e.g., deterministic ordering vs shuffled)? I used the repo data; please confirm that matches the paper split.&#x20;
3. In `scripts/sudoku/train-mdm.sh`, what is the **default `--decoding_strategy`** you used for Sudoku? My eval used `stochastic0.5-linear`; if the intended default is `deterministic-<linear|cosine>`, I can rerun to check.
4. Any **expected seed sensitivity**? If yes, how many seeds should reach 100%? Could you share your seed(s)?
5. Is there any recommended **environment pinning** (PyTorch/Transformers/Accelerate versions) beyond `requirements.txt` that affects Sudoku?

---

I can provide full logs (`train.log`, `eval.log`) and W&B run links if needed.
Would appreciate guidance on what might be causing the large accuracy gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudoku MGDM Reproduction Issue: Only ~30–40% ACC with Default Settings (Paper Reports 100%) #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Topic	Paper / README	What I used
Epochs / Batch / LR (tiny)	Sudoku 300 ep, batch 1024, LR=1e-3 (tiny)	300 / 1024 / 1e-3
T (diffusion steps)	T=20 if avg output tokens >20 (Sudoku has 81)	20
Decoding temperature τ	τ=0.5 for all tasks	I used `--decoding_strategy stochastic0.5-linear`. (Repo code seems to do argmax for x₀ and uses stochasticity only for which positions to update; there isn’t an explicit τ sampling switch. Please confirm.)
Reweighting	Table 3 ablates sequence/token reweighting; main text shows 6M MGDM hits 100% on Sudoku but doesn’t spell the exact reweighting combo for that point	`--time_reweighting linear`, `--token_reweighting True`, `--alpha 0.25`, `--gamma 1`
Cutoff / lengths	Sudoku `cutoff_len=164` (README) ([GitHub][1])	164
Data split	First 100k train, next 1k test (paper)	Using repo’s `sudoku_train` / `sudoku_test`

Sudoku MGDM Reproduction Issue: Only ~30–40% ACC with Default Settings (Paper Reports 100%) #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions