Skip to content

Refractor#2456

Draft
Qubitium wants to merge 60 commits intomainfrom
refractor-simple-quant
Draft

Refractor#2456
Qubitium wants to merge 60 commits intomainfrom
refractor-simple-quant

Conversation

@Qubitium
Copy link
Collaborator

@Qubitium Qubitium commented Mar 9, 2026

No description provided.

@Qubitium
Copy link
Collaborator Author

Qubitium commented Mar 10, 2026

Fused gguf: zen3

 +--------+------------+-------------+----------+---------+
  | device | case       | baseline_ms | fused_ms | speedup |
  +--------+------------+-------------+----------+---------+
  | cpu    | attn q4_k  | 27.397      | 18.865   | 1.45x   |
  | cpu    | attn q5_k  | 28.618      | 22.510   | 1.27x   |
  | cpu    | attn q6_k  | 26.458      | 28.665   | 0.92x   |
  | cpu    | mlp q4_k   | 83.428      | 45.598   | 1.83x   |
  | cpu    | mlp q5_k   | 101.261     | 50.265   | 2.01x   |
  | cpu    | mlp q6_k   | 84.662      | 51.076   | 1.66x   |
  | cuda   | attn q4_k  | 0.778       | 0.652    | 1.19x   |
  | cuda   | attn q5_k  | 0.612       | 0.625    | 0.98x   |
  | cuda   | attn q6_k  | 0.433       | 0.440    | 0.99x   |
  | cuda   | mlp q4_k   | 0.793       | 0.596    | 1.33x   |
  | cuda   | mlp q5_k   | 0.943       | 0.780    | 1.21x   |
  | cuda   | mlp q6_k   | 0.720       | 0.535    | 1.35x   |
  +--------+------------+-------------+----------+---------+

  Autotuned dispatch, shipped defaults, post-warmup steady state:

  +--------+------------+------+-----------+-------------+---------+
  | device | case       | plan | static_ms | autotune_ms | speedup |
  +--------+------------+------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused| 24.534    | 22.205      | 1.10x   |
  | cpu    | attn q5_k  | fused| 24.922    | 23.628      | 1.05x   |
  | cpu    | attn q6_k  | fused| 16.560    | 14.065      | 1.18x   |
  | cpu    | mlp q4_k   | fused| 48.167    | 44.739      | 1.08x   |
  | cpu    | mlp q5_k   | fused| 58.313    | 53.621      | 1.09x   |
  | cpu    | mlp q6_k   | fused| 53.546    | 49.650      | 1.08x   |
  | cuda   | attn q4_k  | none | 0.543     | 0.530       | 1.02x   |
  | cuda   | attn q5_k  | none | 0.649     | 0.647       | 1.00x   |
  | cuda   | attn q6_k  | none | 0.507     | 0.612       | 0.83x   |
  | cuda   | mlp q4_k   | fused| 0.589     | 0.593       | 0.99x   |
  | cuda   | mlp q5_k   | fused| 0.692     | 0.702       | 0.99x   |
  | cuda   | mlp q6_k   | fused| 0.525     | 0.521       | 1.01x   |
  +--------+------------+------+-----------+-------------+---------+

  Autotuned dispatch with --force-candidate, to answer the earlier attention question directly:

  +--------+------------+---------------+-----------+-------------+---------+
  | device | case       | autotune plan | static_ms | autotune_ms | speedup |
  +--------+------------+---------------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused         | 20.209    | 18.833      | 1.07x   |
  | cpu    | attn q5_k  | fused         | 26.965    | 25.374      | 1.06x   |
  | cpu    | attn q6_k  | fused         | 20.340    | 18.419      | 1.10x   |
  | cpu    | mlp q4_k   | fused         | 54.147    | 49.410      | 1.10x   |
  | cpu    | mlp q5_k   | fused         | 58.414    | 50.907      | 1.15x   |
  | cpu    | mlp q6_k   | fused         | 46.514    | 48.151      | 0.97x   |
  | cuda   | attn q4_k  | dense         | 0.543     | 0.540       | 1.00x   |
  | cuda   | attn q5_k  | dense         | 0.656     | 0.625       | 1.05x   |
  | cuda   | attn q6_k  | dense         | 0.459     | 0.467       | 0.98x   |
  | cuda   | mlp q4_k   | fused         | 0.699     | 0.645       | 1.08x   |
  | cuda   | mlp q5_k   | fused         | 0.677     | 0.672       | 1.01x   |
  | cuda   | mlp q6_k   | fused         | 0.570     | 0.701       | 0.81x   |
  +--------+------------+---------------+-----------+-------------+---------+

@Qubitium
Copy link
Collaborator Author

Qubitium commented Mar 10, 2026

gguf triton kernel is now faster on average on a module inference level than llama cpp cuda on 4090

  +----------------------+--------+--------+
  | metric               | before | after  |
  +----------------------+--------+--------+
  | triton wins          | 308    | 358    |
  | cpp wins             | 112    | 62     |
  | torch wins           | 0      | 0      |
  | triton median speed  | 1.000x | 1.039x |
  | triton mean speed    | 1.000x | 1.111x |
  +----------------------+--------+--------+

@Qubitium Qubitium changed the title Major Refractor: v6.0 roadmap Refractor Mar 11, 2026

# dist_ref = torch.empty((512,), dtype = torch.float, device = weight.device)
# dist_r = torch.empty_like(dist_ref)
def jsd(h1, h2):
Hd = None
proxy_err = num / max(den, 1e-8)
except torch.OutOfMemoryError:
weight_r = None
proxy_err = num / max(den, 1e-8)
except torch.OutOfMemoryError:
weight_r = None
E = None
except torch.OutOfMemoryError:
weight_r = None
E = None
W = None
weight_r = None
E = None
W = None
Hd = None
if self._inner is not None:
try:
self._inner.unload()
except Exception:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant