Skip to content

Intel Arc Pro B60 - Host Freeze with VFIO Passthrough and ReBAR Enabled #880

@Qmo37

Description

@Qmo37

Intel Arc Pro B60 - VFIO Passthrough Issue and Small BAR Compute Question

Description

I am experiencing host system freezes when using Intel Arc Pro B60 24GB with VFIO/KVM passthrough on Proxmox VE. I've done some investigation and would like to ask if there are any known workarounds or if my assumptions about the cause are correct.


Environment

Proxmox Host

Component Details
GPU Intel Arc Pro B60 24GB (PCI ID: 8086:e211)
GPU Firmware BMG_21.1174
GOP (vBIOS) 23.0.1061
CPU Intel Core Ultra 7 265K
Motherboard ASUS PRIME Z890-P WIFI-CSM (BIOS 2401)
RAM 96GB DDR5
Host OS Proxmox VE 9.x
Host Kernel 6.17.4-1-pve

Guest VM

Component Details
OS Ubuntu 24.04.3 LTS
Kernel 6.14.0-1008-intel
Driver xe

What I Observed

With ReBAR Enabled

At VM startup, I see this warning:

QEMU: kvm: warning: vfio_container_dma_map(0x380000000000, 0x800000000, ...) = -22 (Invalid argument)
0000:04:00.0: PCI peer-to-peer transactions on BARs are not supported

BAR configuration (from Proxmox host):

Region 2: Memory at a000000000 (64-bit, prefetchable) [size=32G]
BAR 2: current size: 32GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB

IOMMU configuration (from Proxmox host):

DMAR: Host address width 42

After 6-24 hours, the host freezes. I observed three different freeze behaviors:

  1. Full Host Freeze: Entire Proxmox becomes unresponsive, requires hard reset
  2. Partial Host Freeze: VMs continue running, web UI partially works, but shell/SSH becomes inaccessible
  3. VM Freeze: VM shows "running (internal-error)", host stays responsive

Kernel soft lockup (captured from console before host became unresponsive):

[33275.695601] watchdog: BUG: soft lockup - CPU#14 stuck for 5212s! [ps:360867]
[33255.695867]  ? x64_sys_call+0x1151/0x2330
[33255.695868]  ? do_syscall_64+0xb8/0xa30
[33255.695869]  ? x64_sys_call+0x2093/0x2330
[33255.695869]  ? do_syscall_64+0xb8/0xa30
[33255.695870]  ? count_memcg_events+0xd7/0x1a0
[33255.695872]  ? handle_mm_fault+0x254/0x370
[33255.695874]  ? do_user_addr_fault+0x2f8/0x830
[33255.695875]  ? irqentry_exit_to_user_mode+0x2e/0x290
[33255.695877]  ? irqentry_exit+0x43/0x50
[33255.695878]  ? exc_page_fault+0x90/0x1b0
[33255.695879]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The CPU was stuck for ~87 minutes (5212 seconds) before the watchdog reported it.

With ReBAR Manually Disabled

To test, I manually limited the BAR to 256MB via VFIO-PCI rebind on Proxmox host (and disabled ReBAR option in BIOS):

echo "0000:04:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:04:00.0" > /sys/bus/pci/drivers/vfio-pci/bind

lspci -vvv -s 04:00.0 | grep "Region 2"
# Region 2: Memory at a000000000 (64-bit, prefetchable) [size=256M]

qm start 101

dmesg | tail -20 | grep -i "vfio\|dma\|p2p"
# [611.243671] vfio-pci 0000:04:00.0: resetting
# [611.350629] vfio-pci 0000:04:00.0: reset done
# [611.375782] vfio-pci 0000:05:00.0: enabling

Result: The host no longer freezes.

However, inside the VM I see:

$ lspci -vvs 01:00.0 | grep Region
    Region 0: Memory at 80000000 (64-bit, non-prefetchable) [size=16M]
    Region 2: Memory at 380000000000 (64-bit, prefetchable) [size=256M]

$ python3 -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"

WARNING: Small BAR detected for device 0000:01:00.0
XPU device count is zero!
False 0

GPU compute is disabled with small BAR configuration.


What I Already Tried

Fix Attempted Result
Update GPU firmware in a bare-metal Windows machine (Windows driver 101.8306) Freeze still occurred
Update BIOS (2207 → 2401) Freeze still occurred (escalated from VM freeze to full host freeze)
Disable ASPM (via setpci) Freeze still occurred
Enable GPU reset methods ("bus") Immediate host crash when restarting VM
Kernel parameters (pcie_aspm=off, etc.) Freeze still occurred

My Assumptions (I Could Be Wrong)

Based on what I observed:

  1. The DMA mapping failure (-22 EINVAL) at address 0x380000000000 with size 0x800000000 might be related to the host's 42-bit IOMMU address width?

  2. The host freeze might occur when something tries to access memory regions that weren't properly mapped due to the DMA mapping failure?

  3. The kernel soft lockup call stack (exc_page_faulthandle_mm_fault) suggests a page fault handler is waiting for a PCIe/DMA transaction that never completes?

  4. The "Small BAR detected" warning in the VM causes the compute runtime to disable XPU functionality?


Questions

  1. Is there a known workaround for using Arc Pro B60 with VFIO passthrough and ReBAR?

  2. Are there any kernel parameters, QEMU options, or driver settings I might have missed?

  3. Is there a way to enable compute functionality with small BAR configuration? I understand there might be performance tradeoffs.

  4. Are my assumptions about the cause reasonable, or should I be looking at something else?

  5. Should I be filing this somewhere else (xe driver GitLab, QEMU, etc.)?


Environment Summary

Host:
  Proxmox VE 9.x, Kernel 6.17.4-1-pve
  ASUS PRIME Z890-P WIFI-CSM (BIOS 2401)
  Intel Core Ultra 7 265K, 96GB DDR5
  IOMMU: Host address width 42

GPU:
  Intel Arc Pro B60 24GB (8086:e211)
  Firmware: BMG_21.1174
  GOP: 23.0.1061

Guest VM:
  Ubuntu 24.04.3 LTS, Kernel 6.14.0-1008-intel
  Driver: xe
  PyTorch: 2.7.0+xpu

Report date: 2026-01-07
This report was prepared with assistance from Claude Code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions