Since I only have one 3090 GPU, I set the learning rate very low in Stage 1 (the original learning rate would cause NaN issues):
optimizer = dict(
type="AdamW",
lr=5e-5,
weight_decay=0.001,
paramwise_cfg=dict(
custom_keys={
"img_backbone": dict(lr_mult=0.8),
}
),
)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
policy="CosineAnnealing",
warmup="linear",
warmup_iters=500,
warmup_ratio=1.0 / 3,
min_lr_ratio=1e-4,
)
Configuration in Stage 2:
optimizer = dict(
type="AdamW",
# lr=3e-4,
lr=3e-4 / 2,
weight_decay=0.001,
paramwise_cfg=dict(
custom_keys={
"img_backbone": dict(lr_mult=0.1),
}
),
)
optimizer_config = dict(grad_clip=dict(max_norm=25, norm_type=2))
lr_config = dict(
policy="CosineAnnealing",
warmup="linear",
warmup_iters=500,
warmup_ratio=1.0 / 3,
min_lr_ratio=1e-3,
)
The results I reproduced have a significant difference.
