Refractor by Qubitium · Pull Request #2456 · ModelCloud/GPTQModel

Qubitium · 2026-03-09T05:08:09Z

No description provided.

gptqmodel/looper/calibrationless_gptq_processor.py

tests/models/model_test.py

…a sigma-width window

gptqmodel/nn_modules/qlinear/gguf.py

gptqmodel/quantization/config.py

tests/qcfg/test_config_dispatch.py

tests/test_weight_only_config.py

tests/test_weight_only.py

Qubitium · 2026-03-10T05:26:18Z

Fused gguf: zen3

 +--------+------------+-------------+----------+---------+
  | device | case       | baseline_ms | fused_ms | speedup |
  +--------+------------+-------------+----------+---------+
  | cpu    | attn q4_k  | 27.397      | 18.865   | 1.45x   |
  | cpu    | attn q5_k  | 28.618      | 22.510   | 1.27x   |
  | cpu    | attn q6_k  | 26.458      | 28.665   | 0.92x   |
  | cpu    | mlp q4_k   | 83.428      | 45.598   | 1.83x   |
  | cpu    | mlp q5_k   | 101.261     | 50.265   | 2.01x   |
  | cpu    | mlp q6_k   | 84.662      | 51.076   | 1.66x   |
  | cuda   | attn q4_k  | 0.778       | 0.652    | 1.19x   |
  | cuda   | attn q5_k  | 0.612       | 0.625    | 0.98x   |
  | cuda   | attn q6_k  | 0.433       | 0.440    | 0.99x   |
  | cuda   | mlp q4_k   | 0.793       | 0.596    | 1.33x   |
  | cuda   | mlp q5_k   | 0.943       | 0.780    | 1.21x   |
  | cuda   | mlp q6_k   | 0.720       | 0.535    | 1.35x   |
  +--------+------------+-------------+----------+---------+

  Autotuned dispatch, shipped defaults, post-warmup steady state:

  +--------+------------+------+-----------+-------------+---------+
  | device | case       | plan | static_ms | autotune_ms | speedup |
  +--------+------------+------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused| 24.534    | 22.205      | 1.10x   |
  | cpu    | attn q5_k  | fused| 24.922    | 23.628      | 1.05x   |
  | cpu    | attn q6_k  | fused| 16.560    | 14.065      | 1.18x   |
  | cpu    | mlp q4_k   | fused| 48.167    | 44.739      | 1.08x   |
  | cpu    | mlp q5_k   | fused| 58.313    | 53.621      | 1.09x   |
  | cpu    | mlp q6_k   | fused| 53.546    | 49.650      | 1.08x   |
  | cuda   | attn q4_k  | none | 0.543     | 0.530       | 1.02x   |
  | cuda   | attn q5_k  | none | 0.649     | 0.647       | 1.00x   |
  | cuda   | attn q6_k  | none | 0.507     | 0.612       | 0.83x   |
  | cuda   | mlp q4_k   | fused| 0.589     | 0.593       | 0.99x   |
  | cuda   | mlp q5_k   | fused| 0.692     | 0.702       | 0.99x   |
  | cuda   | mlp q6_k   | fused| 0.525     | 0.521       | 1.01x   |
  +--------+------------+------+-----------+-------------+---------+

  Autotuned dispatch with --force-candidate, to answer the earlier attention question directly:

  +--------+------------+---------------+-----------+-------------+---------+
  | device | case       | autotune plan | static_ms | autotune_ms | speedup |
  +--------+------------+---------------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused         | 20.209    | 18.833      | 1.07x   |
  | cpu    | attn q5_k  | fused         | 26.965    | 25.374      | 1.06x   |
  | cpu    | attn q6_k  | fused         | 20.340    | 18.419      | 1.10x   |
  | cpu    | mlp q4_k   | fused         | 54.147    | 49.410      | 1.10x   |
  | cpu    | mlp q5_k   | fused         | 58.414    | 50.907      | 1.15x   |
  | cpu    | mlp q6_k   | fused         | 46.514    | 48.151      | 0.97x   |
  | cuda   | attn q4_k  | dense         | 0.543     | 0.540       | 1.00x   |
  | cuda   | attn q5_k  | dense         | 0.656     | 0.625       | 1.05x   |
  | cuda   | attn q6_k  | dense         | 0.459     | 0.467       | 0.98x   |
  | cuda   | mlp q4_k   | fused         | 0.699     | 0.645       | 1.08x   |
  | cuda   | mlp q5_k   | fused         | 0.677     | 0.672       | 1.01x   |
  | cuda   | mlp q6_k   | fused         | 0.570     | 0.701       | 0.81x   |
  +--------+------------+---------------+-----------+-------------+---------+

Qubitium · 2026-03-10T10:55:23Z

gguf triton kernel is now faster on average on a module inference level than llama cpp cuda on 4090

  +----------------------+--------+--------+
  | metric               | before | after  |
  +----------------------+--------+--------+
  | triton wins          | 308    | 358    |
  | cpp wins             | 112    | 62     |
  | torch wins           | 0      | 0      |
  | triton median speed  | 1.000x | 1.039x |
  | triton mean speed    | 1.000x | 1.111x |
  +----------------------+--------+--------+

tests/test_hf_utils.py

…ble.m

# Conflicts: # pyproject.toml

gptqmodel/quantization/config.py

gptqmodel/exllamav3/modules/quant/exl3_lib/quantize.py

gptqmodel/looper/exllamav3_processor.py

gptqmodel/exllamav3/modules/quant/exl3.py

gptqmodel/exllamav3/util/memory.py

# Conflicts: # requirements.txt

…format`

# Conflicts: # tests/test_model.py

gptqmodel/exllamav3/modules/quant/exl3_lib/quantize.py

+
+    # dist_ref = torch.empty((512,), dtype = torch.float, device = weight.device)
+    # dist_r = torch.empty_like(dist_ref)
+    def jsd(h1, h2):


gptqmodel/exllamav3/modules/quant/exl3_lib/quantize.py

+                Hd = None
+                proxy_err = num / max(den, 1e-8)
+            except torch.OutOfMemoryError:
+                weight_r = None


gptqmodel/exllamav3/modules/quant/exl3_lib/quantize.py

+                proxy_err = num / max(den, 1e-8)
+            except torch.OutOfMemoryError:
+                weight_r = None
+                E = None


gptqmodel/exllamav3/modules/quant/exl3_lib/quantize.py

+            except torch.OutOfMemoryError:
+                weight_r = None
+                E = None
+                W = None


gptqmodel/exllamav3/modules/quant/exl3_lib/quantize.py

+                weight_r = None
+                E = None
+                W = None
+                Hd = None


gptqmodel/nn_modules/exllamav3.py

+        if self._inner is not None:
+            try:
+                self._inner.unload()
+            except Exception:


Qubitium added 2 commits March 9, 2026 04:27

init calibration less quant refractor

1b1e8f4

refractor quant config

661326b

github-code-quality bot found potential problems Mar 9, 2026

View reviewed changes

gptqmodel/looper/calibrationless_gptq_processor.py Fixed Show fixed Hide fixed

gptqmodel/looper/calibrationless_gptq_processor.py Fixed Show fixed Hide fixed

Qubitium added 4 commits March 9, 2026 05:26

refractor quant config 2

781ba2c

refractor quant config 3

995b5da

rename calibrationless to weight_only

55084cd

fix awq oom

c00531f

github-code-quality bot found potential problems Mar 9, 2026

View reviewed changes

tests/models/model_test.py Fixed Show fixed Hide fixed

Qubitium added 11 commits March 9, 2026 07:13

v6.0.0 update

c1f125b

cleanup

dd85bc8

stable return tuples

e2cc88e

accelerate depend 1.13.0

3e29f23

cleanup hf kernel gptq/awq post_init loading

ee1ba3f

fix test

e256c76

fix SmoothMAD overly-aggressive clipping: normalize k to behave like …

88e4e9b

…a sigma-width window

simplify

a3b5ee7

initial gguf

b5da414

gguf refractor

1cb7bca

gguf refractor

75eb3db

github-code-quality bot found potential problems Mar 9, 2026

View reviewed changes

gptqmodel/nn_modules/qlinear/gguf.py Fixed Show fixed Hide fixed

Qubitium added 3 commits March 10, 2026 01:05

gguf unit test

0287831

fix gguf should directly bypass rtn with optional smoother

9d3c96d

add test

e8da713

github-code-quality bot found potential problems Mar 10, 2026

View reviewed changes

gptqmodel/nn_modules/qlinear/gguf.py Fixed Show fixed Hide fixed

Qubitium added 2 commits March 10, 2026 03:42

refractor config

c14d624

refractor config part2

c76acac

github-code-quality bot found potential problems Mar 10, 2026

View reviewed changes

gptqmodel/quantization/config.py Fixed Show fixed Hide fixed

tests/qcfg/test_config_dispatch.py Fixed Show fixed Hide fixed

tests/test_weight_only_config.py Fixed Show fixed Hide fixed

tests/test_weight_only.py Fixed Show fixed Hide fixed

Qubitium added 2 commits March 10, 2026 04:51

gguf dequant to native type, not fp32

af124eb

fuse gguf ops

c52e82b

optimize gguf triton kernel vs cpp

59da42c

Qubitium added 3 commits March 10, 2026 16:47

prioriize gguf auto kernel selection

3be5d85

Merge remote-tracking branch 'origin/main' into refractor-simple-quant

45266af

bug fixes

3313e47

github-code-quality bot found potential problems Mar 10, 2026

View reviewed changes

tests/test_hf_utils.py Fixed Show fixed Hide fixed

tests/test_hf_utils.py Fixed Show fixed Hide fixed

Qubitium added 3 commits March 11, 2026 00:16

use qcfg.format and deprecate qcfg.checkpoint_format as much as possi…

bae1164

…ble.m

protcol design

1090f01

Merge remote-tracking branch 'origin/main' into refractor-simple-quant

3de9a80

# Conflicts: # pyproject.toml

Qubitium changed the title ~~Major Refractor: v6.0 roadmap~~ Refractor Mar 11, 2026

simplify

c070082

github-code-quality bot found potential problems Mar 11, 2026

View reviewed changes

gptqmodel/quantization/config.py Fixed Show fixed Hide fixed

Qubitium added 5 commits March 11, 2026 17:23

unit test simple protocol

6a76fc0

refractor match

b757c92

update tests

980778d

update tests

1999c73

clarify failscale placement

c3e5d0f

Qubitium mentioned this pull request Mar 12, 2026

Feature/calibration data device #2421

Open

Qubitium added 2 commits March 12, 2026 01:08

failsafe to fallback rename

007b793

add exl3 (exllama v3) support

c03aab2

github-code-quality bot found potential problems Mar 12, 2026

View reviewed changes

Qubitium added 7 commits March 12, 2026 08:32

add exl3 unit test

8bc5bf8

Merge remote-tracking branch 'origin/main' into refractor-simple-quant

01af1bf

# Conflicts: # requirements.txt

add fp8 support

097b054

soft deprecate quant_method for method, checkpoint_format for `…

4271453

…format`

hard deprecate qcfg.is_marlin_format

1c762ca

Merge remote-tracking branch 'origin/main' into refractor-simple-quant

94d5d99

# Conflicts: # tests/test_model.py

resolve github code quality issues

28ed59f

github-code-quality bot found potential problems Mar 12, 2026

View reviewed changes

update readme

4e1e852

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refractor#2456

Refractor#2456
Qubitium wants to merge 60 commits intomainfrom
refractor-simple-quant

Qubitium commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Mar 10, 2026 •

edited

Loading

Uh oh!

Qubitium commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Qubitium commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Qubitium commented Mar 10, 2026 •

edited

Loading

Qubitium commented Mar 10, 2026 •

edited

Loading