Update to llama.cpp 2026-01-01 #2108

avion23 · 2026-01-01T18:07:07Z

Bindings were 5 months outdated, preventing newer model architectures from loading.

Updates bindings to llama.cpp commit be47fb92 (2026-01-01).

Removed

14 llama_kv_self_* functions (use llama_memory_* API)
llama_sampler_init_softmax()

Added

Enums:

LLAMA_ROPE_TYPE_IMROPE
llama_flash_attn_type
llama_params_fit_status
llama_model_meta_key

Struct fields:

llama_model_params: no_host, no_alloc
llama_context_params: flash_attn_type (replaced flash_attn bool)

Functions:
llama_max_tensor_buft_overrides, llama_n_ctx_seq, llama_model_n_embd_inp, llama_model_is_hybrid, llama_flash_attn_type_name, llama_model_meta_key_str, llama_adapter_meta_* (5 functions), llama_log_get, llama_log_set, llama_memory_breakdown_print

Breaking Changes

flash_attn parameter:

# Old
params.flash_attn = True
# New
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_ENABLED

KV cache API:

# Old
llama_kv_self_clear(ctx)
# New
llama_memory_clear(mem, data=True)

Other

Added ggml_log_callback typedef
Fixed LLAVA/mtmd build (set LLAMA_INSTALL_VERSION before subdirectory include)
Version 0.3.16 → 0.4.0

Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B

avion23 · 2026-01-01T20:07:17Z

Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages

dhdaines · 2026-01-04T03:45:17Z

This will need at least one more (very important) change as the layout of mtmd_context_params has changed. It should be updated in mtmd_cpp.py to this:

class mtmd_context_params(Structure):
    _fields_ = [
        ("use_gpu", c_bool),
        ("print_timings", c_bool),
        ("n_threads", c_int),
        ("image_marker", c_char_p),
        ("media_marker", c_char_p),
        ("flash_attn_type", c_int),
        ("warmup", c_bool),
        ("image_min_tokens", c_int),
        ("image_max_tokens", c_int),
    ]

dhdaines · 2026-01-04T04:05:39Z

More changes needed as the layout of llama_context_params has also changed... a new field flash_attn_type has been added after attention_type.

dhdaines · 2026-01-04T05:17:15Z

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

dhdaines · 2026-01-04T13:00:10Z

llama_cpp/llama_cpp.py

+@ctypes_function("llama_max_tensor_buft_overrides", [], ctypes.c_size_t)
+def llama_max_tensor_buft_overrides() -> int:
+    """Get maximum number of tensor buffer type overrides"""
+    ...


Stray ellipsis operator (which does nothing, but still)

Sorry! The issue here isn't the ellipsis operator, it's that at some point there were two of them - you shouldn't change this to pass because that implies that the function returns None, which will cause type checking to fail.

thank you, and I understand now. done

avion23 · 2026-01-04T13:43:09Z

@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime

avion23 · 2026-01-05T10:20:39Z

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

I think I have fixed this, could you check?

dhdaines · 2026-01-05T18:47:15Z

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

I think I have fixed this, could you check?

Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though...

- Update vendor/llama.cpp submodule to be47fb92 (2026-01-01) - Bump version from 0.3.16 to 0.4.0 Breaking changes: - Migrate flash_attn bool to flash_attn_type enum (backward compatible via None=AUTO) - Replace llama_kv_self_* API with llama_memory_* API New features: - Add LLAMA_FLASH_ATTN_TYPE_* enum (AUTO/DISABLED/ENABLED) - Add llama_model_params fields: no_host, no_alloc - Add mtmd_context_params fields: flash_attn_type, warmup, image_min/max_tokens - Add LLAMA_ROPE_TYPE_IMROPE, LLAMA_PARAMS_FIT_STATUS_* enums - Add 15+ new functions: llama_max_tensor_buft_overrides, llama_n_ctx_seq, llama_model_n_embd_inp, llama_model_is_hybrid, llama_log_*, llama_memory_*, llama_attach/detach_threadpool, llama_adapter_meta_* (4 functions) Fixes: - Server settings: flash_attn default None (AUTO) instead of False (DISABLED) - Enable FIM token functions: token_prefix/middle/suffix - Fix typos: additonal→additional, unnused→unused - Remove deprecated verbosity field from mtmd_context_params - Add CMake version workaround documentation Code quality: - Consistent stub style (... not pass) - Struct alignment verified against llama.h and mtmd.h - Minimal whitespace noise (0.4% of diff)

avion23 · 2026-01-06T19:22:13Z

My intention was to sweep in like a hero and save the day. Didn't work as planned :/

I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated.

oss-roettger · 2026-01-08T15:08:48Z

Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version!

I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt).

I'm pretty sure, it can be my mistake when installing your branch for GPU support:
!CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01

Any ideas what I did wrong?!

Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path.
Edit2: Very,very strange: after switching back to n_gpu_layers=100 (from n_gpu_layers=0) I was able to load and successfully run the new Nemotron-Nano-3-30B-A3B-Q4_K_M.gguf and Ling-mini-2.0.Q4_K_M.gguf models on GPU (on the same build that crashed the kernel always(!) while loading the models before. Could it be, that there is any context initialization code which is run in CPU mode only but also important for GPU mode?!

avion23 marked this pull request as draft January 1, 2026 19:40

avion23 force-pushed the update-llama-cpp-2026-01 branch from 502532a to 23c10e8 Compare January 1, 2026 19:50

avion23 marked this pull request as ready for review January 1, 2026 19:52

dhdaines mentioned this pull request Jan 4, 2026

feat: support Granite-Docling model #2109

Open

avion23 force-pushed the update-llama-cpp-2026-01 branch from 23c10e8 to a070f61 Compare January 4, 2026 12:05

dhdaines reviewed Jan 4, 2026

View reviewed changes

avion23 marked this pull request as draft January 4, 2026 13:41

avion23 force-pushed the update-llama-cpp-2026-01 branch from 5042296 to d14a24f Compare January 4, 2026 13:42

avion23 force-pushed the update-llama-cpp-2026-01 branch 3 times, most recently from 64b087c to 3ffec02 Compare January 5, 2026 10:18

avion23 force-pushed the update-llama-cpp-2026-01 branch 2 times, most recently from 6dbddac to 39a2ee8 Compare January 5, 2026 14:35

avion23 force-pushed the update-llama-cpp-2026-01 branch from 39a2ee8 to 103f671 Compare January 6, 2026 19:17

avion23 marked this pull request as ready for review January 6, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to llama.cpp 2026-01-01 #2108

Update to llama.cpp 2026-01-01 #2108

avion23 commented Jan 1, 2026 •

edited

Loading

Uh oh!

avion23 commented Jan 1, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines Jan 4, 2026

Uh oh!

dhdaines Jan 4, 2026

Uh oh!

avion23 Jan 5, 2026

Uh oh!

avion23 commented Jan 4, 2026

Uh oh!

avion23 commented Jan 5, 2026

Uh oh!

dhdaines commented Jan 5, 2026

Uh oh!

avion23 commented Jan 6, 2026 •

edited

Loading

Uh oh!

oss-roettger commented Jan 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update to llama.cpp 2026-01-01 #2108

Are you sure you want to change the base?

Update to llama.cpp 2026-01-01 #2108

Conversation

avion23 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Removed

Added

Breaking Changes

Other

Uh oh!

avion23 commented Jan 1, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

dhdaines Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

avion23 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

avion23 commented Jan 4, 2026

Uh oh!

avion23 commented Jan 5, 2026

Uh oh!

dhdaines commented Jan 5, 2026

Uh oh!

avion23 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oss-roettger commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avion23 commented Jan 1, 2026 •

edited

Loading

avion23 commented Jan 6, 2026 •

edited

Loading

oss-roettger commented Jan 8, 2026 •

edited

Loading