-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Update to llama.cpp 2026-01-01 #2108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
502532a to
23c10e8
Compare
|
Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages |
|
This will need at least one more (very important) change as the layout of class mtmd_context_params(Structure):
_fields_ = [
("use_gpu", c_bool),
("print_timings", c_bool),
("n_threads", c_int),
("image_marker", c_char_p),
("media_marker", c_char_p),
("flash_attn_type", c_int),
("warmup", c_bool),
("image_min_tokens", c_int),
("image_max_tokens", c_int),
] |
|
More changes needed as the layout of |
|
Also the |
23c10e8 to
a070f61
Compare
| @ctypes_function("llama_max_tensor_buft_overrides", [], ctypes.c_size_t) | ||
| def llama_max_tensor_buft_overrides() -> int: | ||
| """Get maximum number of tensor buffer type overrides""" | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray ellipsis operator (which does nothing, but still)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! The issue here isn't the ellipsis operator, it's that at some point there were two of them - you shouldn't change this to pass because that implies that the function returns None, which will cause type checking to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, and I understand now. done
5042296 to
d14a24f
Compare
|
@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime |
64b087c to
3ffec02
Compare
I think I have fixed this, could you check? |
6dbddac to
39a2ee8
Compare
Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though... |
- Update vendor/llama.cpp submodule to be47fb92 (2026-01-01) - Bump version from 0.3.16 to 0.4.0 Breaking changes: - Migrate flash_attn bool to flash_attn_type enum (backward compatible via None=AUTO) - Replace llama_kv_self_* API with llama_memory_* API New features: - Add LLAMA_FLASH_ATTN_TYPE_* enum (AUTO/DISABLED/ENABLED) - Add llama_model_params fields: no_host, no_alloc - Add mtmd_context_params fields: flash_attn_type, warmup, image_min/max_tokens - Add LLAMA_ROPE_TYPE_IMROPE, LLAMA_PARAMS_FIT_STATUS_* enums - Add 15+ new functions: llama_max_tensor_buft_overrides, llama_n_ctx_seq, llama_model_n_embd_inp, llama_model_is_hybrid, llama_log_*, llama_memory_*, llama_attach/detach_threadpool, llama_adapter_meta_* (4 functions) Fixes: - Server settings: flash_attn default None (AUTO) instead of False (DISABLED) - Enable FIM token functions: token_prefix/middle/suffix - Fix typos: additonal→additional, unnused→unused - Remove deprecated verbosity field from mtmd_context_params - Add CMake version workaround documentation Code quality: - Consistent stub style (... not pass) - Struct alignment verified against llama.h and mtmd.h - Minimal whitespace noise (0.4% of diff)
39a2ee8 to
103f671
Compare
|
My intention was to sweep in like a hero and save the day. Didn't work as planned :/ I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated. |
|
Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version! I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt). I'm pretty sure, it can be my mistake when installing your branch for GPU support: Any ideas what I did wrong?! Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path. |
Bindings were 5 months outdated, preventing newer model architectures from loading.
Updates bindings to llama.cpp commit be47fb92 (2026-01-01).
Removed
llama_kv_self_*functions (usellama_memory_*API)llama_sampler_init_softmax()Added
Enums:
LLAMA_ROPE_TYPE_IMROPEllama_flash_attn_typellama_params_fit_statusllama_model_meta_keyStruct fields:
llama_model_params:no_host,no_allocllama_context_params:flash_attn_type(replacedflash_attnbool)Functions:
llama_max_tensor_buft_overrides,llama_n_ctx_seq,llama_model_n_embd_inp,llama_model_is_hybrid,llama_flash_attn_type_name,llama_model_meta_key_str,llama_adapter_meta_*(5 functions),llama_log_get,llama_log_set,llama_memory_breakdown_printBreaking Changes
flash_attn parameter:
KV cache API:
Other
ggml_log_callbacktypedefLLAMA_INSTALL_VERSIONbefore subdirectory include)Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B