support for encoder from gemma-3-4b-it by vidixha · Pull Request #42 · nahidalam/LLaVA

vidixha · 2025-08-26T06:42:46Z

This PR adds support for using the gemma3-siglip-encoder (from google/gemma-3-4b-it) as a vision tower for LLaVA pre-training with a Vicuna-based LLM.

1. Numerical Instability issue - NaN loss

Initial attempts to pre-train using the gemma3-siglip-encoder resulted in a persistent NaN loss. Debugging revealed that the encoder produces feature outputs with an extremely large numerical magnitude. This triggered a low-level bug deep inside the language model's CrossEntropyLoss function, causing it to fail even when all inputs (logits and labels) were valid.

2. Implementation Details
To enable stable training, the following two-part solution was implemented:

Feature Clipping: A torch.clamp function was added to the encode_images method in llava/model/llava_arch.py. This controls the extreme magnitude of the gemma features by ensuring they are within a stable [-10, 10] range before being passed to the language model.
Manual Loss Calculation: The compute_loss method in llava/train/llava_trainer.py was overridden to bypass the model's unstable internal loss function. This implementation takes the clean logits from the model and performs a stable, manual CrossEntropyLoss calculation.

nahidalam · 2025-08-26T15:06:18Z

llava/model/llava_arch.py

+
+        # ====================================================================================
+        # ================== FIX 2: Brute-force clip the features ============================
+        image_features = torch.clamp(image_features, min=-10.0, max=10.0)


we need to do this only for siglip from gemma3 to make sure others are unchanged?

yes, i will add a check to do it only for vision towers from Gemma

nahidalam · 2025-08-26T15:14:00Z

llava/train/llava_trainer.py


 class LLaVATrainer(Trainer):

+    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):


for reference, this is how compute_loss looks like in HF trainer class https://github.com/huggingface/transformers/blob/052e652d6d53c2b26ffde87e039b723949a53493/src/transformers/trainer.py#L3618

nahidalam · 2025-08-26T19:18:53Z

llava/train/llava_trainer.py


 class LLaVATrainer(Trainer):

+    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):


can we also make sure this compute_loss is bypassed in case of other encoders. I am still not sure why exactly we need a custom loss here. Is it to handle sequence mismatch problem? If so why is it not a problem in other encoders? Or is there any other reason? So unless those are clear, lets make sure we do this custom compute_loss only for gemma3-siglip

nahidalam · 2025-08-28T22:45:45Z

llava/train/train.py

in your if 'siglip' in self.data_args.image_processor.image_processor_type.lower(): line, you need to update to

if 'siglip' or 'gemma' in self.data_args.image_processor.image_processor_type.lower():

vidixha and others added 4 commits August 26, 2025 06:29

support for encoder from gemma-3-4b-it

f522551

Delete hub.py

d45c7f1

Update test_siglip.py

551ccfb

Update llava_arch.py

e32839d

nahidalam reviewed Aug 26, 2025

View reviewed changes

Merge branch 'nahidalam:main' into gemma3-encoder-support

6fd569e

nahidalam reviewed Aug 26, 2025

View reviewed changes

check for gemma3 vision tower

f4f7477

nahidalam reviewed Aug 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for encoder from gemma-3-4b-it#42

support for encoder from gemma-3-4b-it#42
vidixha wants to merge 6 commits intonahidalam:mainfrom
vidixha:gemma3-encoder-support

vidixha commented Aug 26, 2025

Uh oh!

nahidalam Aug 26, 2025

Uh oh!

vidixha Aug 26, 2025 •

edited

Loading

Uh oh!

nahidalam Aug 26, 2025

Uh oh!

nahidalam Aug 26, 2025

Uh oh!

nahidalam Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		class LLaVATrainer(Trainer):

		def compute_loss(self, model, inputs, return_outputs=False, **kwargs):

Conversation

vidixha commented Aug 26, 2025

Uh oh!

nahidalam Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

vidixha Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nahidalam Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

nahidalam Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

nahidalam Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vidixha Aug 26, 2025 •

edited

Loading