Skip to content

Conversation

@NeoLegends
Copy link
Member

I think in a pure LM setup you don't use the scale.

@NeoLegends NeoLegends self-assigned this Dec 22, 2025
@albertz
Copy link
Member

albertz commented Dec 22, 2025

I think in a pure LM setup you don't use the scale.

I don't think this is true in general. I have seen both variants. Also for LMs. For example, Gemma3:

self.embed_tokens = Gemma3TextScaledWordEmbedding(
    config.vocab_size, config.hidden_size, self.padding_idx, embed_scale=self.config.hidden_size**0.5
)

Note, there are some other things to consider:

If you don't apply the scale in forward, what people do instead then is to apply the scale during init, or make the random init very large. E.g. nanochat:

elif isinstance(module, nn.Embedding):
    torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

I also saw that some people use custom (much larger) LRs for embeddings, which again might compensate the fact of not using a scale. E.g. see nanochat.

If you share the embedding weights with the LM head, this might affect whether you want such a scale or not (I'm not sure in what way, though...). Most LMs do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants