-
Notifications
You must be signed in to change notification settings - Fork 14.8k
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
Conversation
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;
I don't see why wee need to add this new CLI. The mtmd-cli can do this with -p and --image params
convert_hf_to_gguf.py
Outdated
|
|
||
| # Top-level direct mappings | ||
| if src_no_vm == 'cls_token': | ||
| return [('v.cls_token', data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper mapping instead
tools/mtmd/clip.cpp
Outdated
| if (!ctx->jinaclip_rope_initialized) { | ||
| const int half_dim = rope_dim / 2; | ||
| std::vector<float> base_freqs(half_dim); | ||
| for (int i = 0; i < half_dim; i++) { | ||
| float arange_val = i * 2.0f; // [0, 2, 4, ..., 30] | ||
| float normalized = arange_val / rope_dim; // [0, 2/32, 4/32, ..., 30/32] | ||
| float theta_powered = powf(freq_base, normalized); // theta^normalized | ||
| base_freqs[i] = 1.0f / theta_powered; // 1.0 / theta^normalized | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn’t re‑implementing generic 2D RoPE; it implements JinaCLIP’s VisionRotaryEmbeddingFast.
It uses fractional‑position 2D RoPE (t = arange(ft)/ft * pt) and precomputes a full H×W cos/sin grid; the official 2D RoPE uses integer grid positions (pos_h/pos_w) with ggml_rope_ext and does not include these steps.
This is done to strictly match Jina’s Python semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fractional‑position 2D RoPE (t = arange(ft)/ft * pt)
Based on your code:
time_seq[i] = (float) i / ft_seq_len * pt_seq_len; // [0, 16/36, 32/36, ..., 560/36]
...
freqs_h[t * half_dim + f] = time_seq[t] * base_freqs[f];
Then why don't we scale base_freqs[f] instead? The third param of ggml_rope_ext, the c tensor (freq_scale) is made for this purpose.
Honestly I think this is just YaRN
fd37a5c to
9d02918
Compare
9d02918 to
e19eb27
Compare
e19eb27 to
2d8885b
Compare
2d8885b to
b9f78de
Compare
b9f78de to
2787888
Compare
46f9ee2 to
542ed6a
Compare
445e0d5 to
bd46020
Compare
|
@pockers21 What's up? |
I’m currently adjusting the code and fixing issues. I originally planned to answer your questions together when |
7e0b15b to
2338880
Compare
…icubic;switch to 'jinaclip2'; fix converter constants
764be54 to
c93a390
Compare
…icubic;switch to 'jinaclip2'; fix converter constants
5428aed to
6e89a8b
Compare
…icubic;switch to 'jinaclip2'; fix converter constants
Remove unnecessary try/except Jina text hparams. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
6e89a8b to
40b18aa
Compare
058f873 to
00a6e01
Compare
00a6e01 to
62ce232
Compare
Update Notes (2025‑11‑6)
block_count/projection_dim/feed_forward_length/attention.head_count.
Reproduction
Minimal commands & data (CPU)
jina-bert-v3.pooling_type = MEAN/CLS/LASTclip.projector_type = jinaclip2,clip.vision.rope_theta = 10000(default)CUDA_VISIBLE_DEVICES= ./build/bin/llama-embedding -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0 --pooling mean --embd-normalize 2 --embd-output-format arraypython3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa offCUDA_VISIBLE_DEVICES= ./build/bin/llama-mtmd-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0 --embd-normalize 2 --embd-output-format arraypython3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa offmtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)
Overview
common_embd_normalize(..., 2).llama-jinaclip-cli(built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.Scope of changes
clip.projector_type=jinaclip,clip.vision.rope_theta(configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.clip_n_output_tokens()returns 1 for JinaCLIP;clip_n_mmproj_embd()returns projection_dim.llama-jinaclip-clitarget (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.Validation summary
ci/run.shpasses locally; no ggml op changes in this PR.encode_msand thread scaling; no regression observed. More data can be added if requested.Performance (absolute metrics, CPU-only minimal samples)
GPU group (absolute metrics, minimal samples)