From 5ad0eccd97170c38d96405dbb4e6240748854aba Mon Sep 17 00:00:00 2001 From: Yamini Nimmagadda Date: Mon, 12 Jan 2026 16:48:59 -0800 Subject: [PATCH 1/6] Create OPENVINO.md in llama.cpp backend docs --- docs/backend/OPENVINO.md | 144 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 docs/backend/OPENVINO.md diff --git a/docs/backend/OPENVINO.md b/docs/backend/OPENVINO.md new file mode 100644 index 00000000000..d56c61d8a8b --- /dev/null +++ b/docs/backend/OPENVINO.md @@ -0,0 +1,144 @@ +# OpenVINO Backend for llama.cpp + +This document describes the OpenVINO backend for `llama.cpp`, which enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. + +The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware. + +## Overview + +The OpenVINO backend is implemented in ggml/src/ggml-openvino and provides a translation layer for core GGML operations. It supports FP16 and BF16 models, as well as selected quantized GGUF formats. This backend enables accelerated inference on Intel CPUs, integrated and discrete GPUs, and NPUs, while integrating seamlessly with the existing `llama.cpp` execution flow. + +## Supported Devices + +OpenVINO backend supports the following hardware: + +- Intel CPUs +- Intel integrated GPUs +- Intel NPUs (Requires UD32+ driver) + +Although OpenVINO supports a wide range of [Intel hardware](https://docs.openvino.ai/2025/about-openvino/release-notes-openvino/system-requirements.html), the llama.cpp OpenVINO backend has been validated specifically on AI PCs such as the Intel® Core™ Ultra Series 1 and Series 2. + +## Supported Model Precisions + +### Fully Supported + +- FP16 GGUF +- BF16 GGUF + +### Quantized Models (Partial Support) + +- `Q4_0` +- `Q4_1` +- `Q4_K_M` +- `Q6_K` + +Accuracy and performance optimizations for quantized models are still work in progress. + +## Quantization Support Details + +### CPU + +- **`Q4_0`, `Q4_1`, `Q4_K_M`, `Q6_K` models are supported** +- `Q6_K` tensors (6-bit, gs16 symmetric) are converted to int8 gs16 symmetric +- `Q5_K` tensors (5-bit, gs32 asymmetric) are converted to int8 gs32 asymmetric + +### GPU + +- **`Q4_0`, `Q4_1`, `Q4_K_M`, `Q6_K` models are supported** +- `Q6_K` tensors (6-bit, gs16 symmetric) are requantized to int8 gs32 symmetric +- `Q5_K` tensors (5-bit, gs32 asymmetric) are converted to int8 gs32 asymmetric + +### NPU + +- **Primary supported quantization scheme is `Q4_0`** +- `Q4_0` and `Q4_1` tensors are requantized to int4 gs128 symmetric +- `Q6_K` tensors are dequantized to FP16 + +#### Additional Notes + +- Both `Q4_0` and `Q4_1` models use `Q6_K` for the token embedding tensor and the final matmul weight tensor (often the same tensor) +- `Q4_0` models may produce some `Q4_1` tensors if an imatrix is provided during quantization using `llama-quantize` +- `Q4_K_M` models may include both `Q6_K` and `Q5_K` tensors (observed in Phi-3) + +## Validated Models + +The following models have been validated for functionality on Intel® Core™ Ultra Series 1 and Series 2: + +- [Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF) +- [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) +- [microsoft/Phi-3-mini-4k-instruct-gguf](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf) +- [Qwen/Qwen2.5-1.5B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF) +- [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) +- [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) +- [tencent/Hunyuan-7B-Instruct](https://huggingface.co/tencent/Hunyuan-7B-Instruct) +- [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) + +## Build Instructions + +### Prerequisites + +- OpenVINO runtime and development packages +- CMake +- C++17-compatible compiler + +### Build Example + +```bash +cmake -B build/ReleaseOV \ + -DGGML_OPENVINO=ON \ + -DCMAKE_BUILD_TYPE=Release + +cmake --build build/ReleaseOV -j +``` + +# Runtime Configuration + +The OpenVINO backend can be configured using the following environment variables at runtime to control device selection, caching, debugging, and profiling behavior. + +## Configuration Options + +| Variable | Description | +|--------|-------------| +| `GGML_OPENVINO_DEVICE` | Specify the target device (`CPU`, `GPU`, `NPU`). If not set, the backend automatically selects the first available device in priority order: **GPU → CPU → NPU**. When set to `NPU`, static compilation mode is enabled for optimal performance. | +| `GGML_OPENVINO_CACHE_DIR` | Directory for OpenVINO model caching (recommended: `/tmp/ov_cache`). Enables model caching when set. **Not supported on NPU devices.** | +| `GGML_OPENVINO_PROFILING` | Enable execution-time profiling. | +| `GGML_OPENVINO_DUMP_CGRAPH` | Dump the GGML compute graph to `cgraph.txt`. | +| `GGML_OPENVINO_DUMP_IR` | Export OpenVINO IR files with timestamps. | +| `GGML_OPENVINO_DEBUG_INPUT` | Enable input debugging. | +| `GGML_OPENVINO_DEBUG_OUTPUT` | Enable output debugging. | + +## Example Usage + +### GPU Inference with Profiling + +```bash +export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache +export GGML_OPENVINO_PROFILING=1 +export GGML_OPENVINO_DEVICE=GPU + +./build/ReleaseOV/bin/llama-simple \ + -m ~/models/Llama-3.2-1B-Instruct.fp16.gguf \ + -n 50 \ + "The story of AI is " +``` + +### llama-bench + +```bash +GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1 +``` +-fa 1 is required when running llama-bench with the OpenVINO backend. + +### NPU Notes + +- Prompt processing is currently slower than CPU/GPU +- Smaller context sizes are recommended (e.g. `-c 512`) +- Static compilation mode is enabled automatically +- Model caching is not yet supported + +## Work in Progress + +- Performance and memory optimizations +- Broader quantization coverage +- Support for additional model architectures +- Extensive accuracy validation From 97623673e3b322a66ea72ea8f850a3517bd3a071 Mon Sep 17 00:00:00 2001 From: Yamini Nimmagadda Date: Mon, 12 Jan 2026 17:12:01 -0800 Subject: [PATCH 2/6] Update OPENVINO.md --- docs/backend/OPENVINO.md | 38 +++++++++++++++----------------------- 1 file changed, 15 insertions(+), 23 deletions(-) diff --git a/docs/backend/OPENVINO.md b/docs/backend/OPENVINO.md index d56c61d8a8b..bc3a2c66cd1 100644 --- a/docs/backend/OPENVINO.md +++ b/docs/backend/OPENVINO.md @@ -52,7 +52,7 @@ Accuracy and performance optimizations for quantized models are still work in pr - **Primary supported quantization scheme is `Q4_0`** - `Q4_0` and `Q4_1` tensors are requantized to int4 gs128 symmetric -- `Q6_K` tensors are dequantized to FP16 +- `Q6_K` tensors are requentized to int8 except for the token embedding matrix #### Additional Notes @@ -72,30 +72,17 @@ The following models have been validated for functionality on Intel® Core™ Ul - [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) - [tencent/Hunyuan-7B-Instruct](https://huggingface.co/tencent/Hunyuan-7B-Instruct) - [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) +- [bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF) ## Build Instructions -### Prerequisites +For detailed build instructions, refer to [build.md](../build.md#openvino) -- OpenVINO runtime and development packages -- CMake -- C++17-compatible compiler - -### Build Example - -```bash -cmake -B build/ReleaseOV \ - -DGGML_OPENVINO=ON \ - -DCMAKE_BUILD_TYPE=Release - -cmake --build build/ReleaseOV -j -``` - -# Runtime Configuration +## Runtime Configuration The OpenVINO backend can be configured using the following environment variables at runtime to control device selection, caching, debugging, and profiling behavior. -## Configuration Options +### Configuration Options | Variable | Description | |--------|-------------| @@ -107,9 +94,9 @@ The OpenVINO backend can be configured using the following environment variables | `GGML_OPENVINO_DEBUG_INPUT` | Enable input debugging. | | `GGML_OPENVINO_DEBUG_OUTPUT` | Enable output debugging. | -## Example Usage +### Example Usage -### GPU Inference with Profiling +#### GPU Inference with Profiling ```bash export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache @@ -122,7 +109,7 @@ export GGML_OPENVINO_DEVICE=GPU "The story of AI is " ``` -### llama-bench +#### llama-bench ```bash GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1 @@ -131,11 +118,16 @@ GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1 ### NPU Notes -- Prompt processing is currently slower than CPU/GPU - Smaller context sizes are recommended (e.g. `-c 512`) - Static compilation mode is enabled automatically - Model caching is not yet supported - +- Does not support llama-server -np > 1 (multiple parallel sequences) +- Only supports llama-perplexity -b 512 or smaller + +## Llama.cpp Tools + +The following tools work with the OpenVINO backend on CPU and GPU: llama-simple, llama-run, llama-cli, llama-server, llama-bench, llama-perplexity. + ## Work in Progress - Performance and memory optimizations From f8f194626d04d1b53e7f6bb8808e8e70702806bd Mon Sep 17 00:00:00 2001 From: Yamini Nimmagadda Date: Mon, 12 Jan 2026 17:29:46 -0800 Subject: [PATCH 3/6] Update OPENVINO.md --- docs/backend/OPENVINO.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/backend/OPENVINO.md b/docs/backend/OPENVINO.md index bc3a2c66cd1..7c2e733b03c 100644 --- a/docs/backend/OPENVINO.md +++ b/docs/backend/OPENVINO.md @@ -93,6 +93,9 @@ The OpenVINO backend can be configured using the following environment variables | `GGML_OPENVINO_DUMP_IR` | Export OpenVINO IR files with timestamps. | | `GGML_OPENVINO_DEBUG_INPUT` | Enable input debugging. | | `GGML_OPENVINO_DEBUG_OUTPUT` | Enable output debugging. | +| *`GGML_OPENVINO_STATEFUL_EXECUTION` | Enable stateful execution for better performance | + +*`GGML_OPENVINO_STATEFUL_EXECUTION` is an **Experimental** feature to allow stateful execution for managing the KV cache internally inside the OpenVINO model, improving performance on CPUs and GPUs. Stateful execution is not effective on NPUs, and not all models currently support this feature. This feature is experimental and has been validated only with the llama-simple, llama-cli, llama-bench, and llama-run applications and is recommended to enable for the best performance. Other applications, such as llama-server and llama-perplexity, are not yet supported. ### Example Usage From e20235a04dc8bf828343fcb6ad6576b82f84501e Mon Sep 17 00:00:00 2001 From: Yamini Nimmagadda Date: Mon, 12 Jan 2026 17:37:26 -0800 Subject: [PATCH 4/6] Update OPENVINO.md --- docs/backend/OPENVINO.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backend/OPENVINO.md b/docs/backend/OPENVINO.md index 7c2e733b03c..3395b70e60b 100644 --- a/docs/backend/OPENVINO.md +++ b/docs/backend/OPENVINO.md @@ -52,7 +52,7 @@ Accuracy and performance optimizations for quantized models are still work in pr - **Primary supported quantization scheme is `Q4_0`** - `Q4_0` and `Q4_1` tensors are requantized to int4 gs128 symmetric -- `Q6_K` tensors are requentized to int8 except for the token embedding matrix +- `Q6_K` tensors are requentized to int8 except for the token embedding matrix which is dequantized to fp16 #### Additional Notes From 086878d28d16cbbe594fb4002f74a177e44f06e2 Mon Sep 17 00:00:00 2001 From: Yamini Nimmagadda Date: Mon, 12 Jan 2026 17:43:28 -0800 Subject: [PATCH 5/6] Update build.md --- docs/build.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build.md b/docs/build.md index f7e793c155a..e3a467c37f8 100644 --- a/docs/build.md +++ b/docs/build.md @@ -597,7 +597,7 @@ To read documentation for how to build on IBM Z & LinuxONE, [click here](./build [OpenVINO](https://docs.openvino.ai/2025/index.html) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge. The OpenVINO backend enhances performance by leveraging hardware-specific optimizations and can be enabled for use with llama.cpp. -Follow the instructions below to install OpenVINO runtime and build llama.cpp with OpenVINO support. +Follow the instructions below to install OpenVINO runtime and build llama.cpp with OpenVINO support. For more detailed information on OpenVINO backend, refer to [OPENVINO.md](backend/OPENVINO.md) ### Prerequisites From bc5ff50fc478915939fd32ab154b6016b8ef3387 Mon Sep 17 00:00:00 2001 From: Yamini Nimmagadda Date: Tue, 13 Jan 2026 14:33:16 -0800 Subject: [PATCH 6/6] Update OPENVINO.md --- docs/backend/OPENVINO.md | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/docs/backend/OPENVINO.md b/docs/backend/OPENVINO.md index 3395b70e60b..d69aaedf613 100644 --- a/docs/backend/OPENVINO.md +++ b/docs/backend/OPENVINO.md @@ -36,23 +36,15 @@ Accuracy and performance optimizations for quantized models are still work in pr ## Quantization Support Details -### CPU +### CPU and GPU - **`Q4_0`, `Q4_1`, `Q4_K_M`, `Q6_K` models are supported** -- `Q6_K` tensors (6-bit, gs16 symmetric) are converted to int8 gs16 symmetric -- `Q5_K` tensors (5-bit, gs32 asymmetric) are converted to int8 gs32 asymmetric - -### GPU - -- **`Q4_0`, `Q4_1`, `Q4_K_M`, `Q6_K` models are supported** -- `Q6_K` tensors (6-bit, gs16 symmetric) are requantized to int8 gs32 symmetric -- `Q5_K` tensors (5-bit, gs32 asymmetric) are converted to int8 gs32 asymmetric +- `Q5_K` and `Q6_K` tensors are converted to `Q8_0_C` ### NPU - **Primary supported quantization scheme is `Q4_0`** -- `Q4_0` and `Q4_1` tensors are requantized to int4 gs128 symmetric -- `Q6_K` tensors are requentized to int8 except for the token embedding matrix which is dequantized to fp16 +- `Q6_K` tensors are requantized to `Q4_0_128` in general. For embedding weights, `Q6_K` tensors are requantized to `Q8_0_C` except for the token embedding matrix which is dequantized to fp16 #### Additional Notes