-
Notifications
You must be signed in to change notification settings - Fork 263
Description
Intel Arc Pro B60 - VFIO Passthrough Issue and Small BAR Compute Question
Description
I am experiencing host system freezes when using Intel Arc Pro B60 24GB with VFIO/KVM passthrough on Proxmox VE. I've done some investigation and would like to ask if there are any known workarounds or if my assumptions about the cause are correct.
Environment
Proxmox Host
| Component | Details |
|---|---|
| GPU | Intel Arc Pro B60 24GB (PCI ID: 8086:e211) |
| GPU Firmware | BMG_21.1174 |
| GOP (vBIOS) | 23.0.1061 |
| CPU | Intel Core Ultra 7 265K |
| Motherboard | ASUS PRIME Z890-P WIFI-CSM (BIOS 2401) |
| RAM | 96GB DDR5 |
| Host OS | Proxmox VE 9.x |
| Host Kernel | 6.17.4-1-pve |
Guest VM
| Component | Details |
|---|---|
| OS | Ubuntu 24.04.3 LTS |
| Kernel | 6.14.0-1008-intel |
| Driver | xe |
What I Observed
With ReBAR Enabled
At VM startup, I see this warning:
QEMU: kvm: warning: vfio_container_dma_map(0x380000000000, 0x800000000, ...) = -22 (Invalid argument)
0000:04:00.0: PCI peer-to-peer transactions on BARs are not supported
BAR configuration (from Proxmox host):
Region 2: Memory at a000000000 (64-bit, prefetchable) [size=32G]
BAR 2: current size: 32GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
IOMMU configuration (from Proxmox host):
DMAR: Host address width 42
After 6-24 hours, the host freezes. I observed three different freeze behaviors:
- Full Host Freeze: Entire Proxmox becomes unresponsive, requires hard reset
- Partial Host Freeze: VMs continue running, web UI partially works, but shell/SSH becomes inaccessible
- VM Freeze: VM shows "running (internal-error)", host stays responsive
Kernel soft lockup (captured from console before host became unresponsive):
[33275.695601] watchdog: BUG: soft lockup - CPU#14 stuck for 5212s! [ps:360867]
[33255.695867] ? x64_sys_call+0x1151/0x2330
[33255.695868] ? do_syscall_64+0xb8/0xa30
[33255.695869] ? x64_sys_call+0x2093/0x2330
[33255.695869] ? do_syscall_64+0xb8/0xa30
[33255.695870] ? count_memcg_events+0xd7/0x1a0
[33255.695872] ? handle_mm_fault+0x254/0x370
[33255.695874] ? do_user_addr_fault+0x2f8/0x830
[33255.695875] ? irqentry_exit_to_user_mode+0x2e/0x290
[33255.695877] ? irqentry_exit+0x43/0x50
[33255.695878] ? exc_page_fault+0x90/0x1b0
[33255.695879] entry_SYSCALL_64_after_hwframe+0x76/0x7e
The CPU was stuck for ~87 minutes (5212 seconds) before the watchdog reported it.
With ReBAR Manually Disabled
To test, I manually limited the BAR to 256MB via VFIO-PCI rebind on Proxmox host (and disabled ReBAR option in BIOS):
echo "0000:04:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:04:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
lspci -vvv -s 04:00.0 | grep "Region 2"
# Region 2: Memory at a000000000 (64-bit, prefetchable) [size=256M]
qm start 101
dmesg | tail -20 | grep -i "vfio\|dma\|p2p"
# [611.243671] vfio-pci 0000:04:00.0: resetting
# [611.350629] vfio-pci 0000:04:00.0: reset done
# [611.375782] vfio-pci 0000:05:00.0: enablingResult: The host no longer freezes.
However, inside the VM I see:
$ lspci -vvs 01:00.0 | grep Region
Region 0: Memory at 80000000 (64-bit, non-prefetchable) [size=16M]
Region 2: Memory at 380000000000 (64-bit, prefetchable) [size=256M]
$ python3 -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"
WARNING: Small BAR detected for device 0000:01:00.0
XPU device count is zero!
False 0
GPU compute is disabled with small BAR configuration.
What I Already Tried
| Fix Attempted | Result |
|---|---|
| Update GPU firmware in a bare-metal Windows machine (Windows driver 101.8306) | Freeze still occurred |
| Update BIOS (2207 → 2401) | Freeze still occurred (escalated from VM freeze to full host freeze) |
| Disable ASPM (via setpci) | Freeze still occurred |
| Enable GPU reset methods ("bus") | Immediate host crash when restarting VM |
| Kernel parameters (pcie_aspm=off, etc.) | Freeze still occurred |
My Assumptions (I Could Be Wrong)
Based on what I observed:
-
The DMA mapping failure (
-22 EINVAL) at address0x380000000000with size0x800000000might be related to the host's 42-bit IOMMU address width? -
The host freeze might occur when something tries to access memory regions that weren't properly mapped due to the DMA mapping failure?
-
The kernel soft lockup call stack (
exc_page_fault→handle_mm_fault) suggests a page fault handler is waiting for a PCIe/DMA transaction that never completes? -
The "Small BAR detected" warning in the VM causes the compute runtime to disable XPU functionality?
Questions
-
Is there a known workaround for using Arc Pro B60 with VFIO passthrough and ReBAR?
-
Are there any kernel parameters, QEMU options, or driver settings I might have missed?
-
Is there a way to enable compute functionality with small BAR configuration? I understand there might be performance tradeoffs.
-
Are my assumptions about the cause reasonable, or should I be looking at something else?
-
Should I be filing this somewhere else (xe driver GitLab, QEMU, etc.)?
Environment Summary
Host:
Proxmox VE 9.x, Kernel 6.17.4-1-pve
ASUS PRIME Z890-P WIFI-CSM (BIOS 2401)
Intel Core Ultra 7 265K, 96GB DDR5
IOMMU: Host address width 42
GPU:
Intel Arc Pro B60 24GB (8086:e211)
Firmware: BMG_21.1174
GOP: 23.0.1061
Guest VM:
Ubuntu 24.04.3 LTS, Kernel 6.14.0-1008-intel
Driver: xe
PyTorch: 2.7.0+xpu
Report date: 2026-01-07
This report was prepared with assistance from Claude Code.