Transformer Decoder: extend docs on input embedding scale #91

NeoLegends · 2025-12-22T09:55:42Z

I think in a pure LM setup you don't use the scale.

albertz · 2025-12-22T11:59:36Z

I think in a pure LM setup you don't use the scale.

I don't think this is true in general. I have seen both variants. Also for LMs. For example, Gemma3:

self.embed_tokens = Gemma3TextScaledWordEmbedding(
    config.vocab_size, config.hidden_size, self.padding_idx, embed_scale=self.config.hidden_size**0.5
)

Note, there are some other things to consider:

If you don't apply the scale in forward, what people do instead then is to apply the scale during init, or make the random init very large. E.g. nanochat:

elif isinstance(module, nn.Embedding):
    torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

I also saw that some people use custom (much larger) LRs for embeddings, which again might compensate the fact of not using a scale. E.g. see nanochat.

If you share the embedding weights with the LM head, this might affect whether you want such a scale or not (I'm not sure in what way, though...). Most LMs do this.

Transformer Decoder: extend docs on input embedding scale

c8a0d33

NeoLegends requested review from curufinwe and mmz33 December 22, 2025 09:55

NeoLegends self-assigned this Dec 22, 2025

michelwi approved these changes Dec 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transformer Decoder: extend docs on input embedding scale #91

Transformer Decoder: extend docs on input embedding scale #91

Uh oh!

NeoLegends commented Dec 22, 2025

Uh oh!

albertz commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Transformer Decoder: extend docs on input embedding scale #91

Are you sure you want to change the base?

Transformer Decoder: extend docs on input embedding scale #91

Uh oh!

Conversation

NeoLegends commented Dec 22, 2025

Uh oh!

albertz commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants