Skip to content

Stepcompress cache#7191

Draft
nefelim4ag wants to merge 2 commits intoKlipper3d:masterfrom
nefelim4ag:stepcompress-cache
Draft

Stepcompress cache#7191
nefelim4ag wants to merge 2 commits intoKlipper3d:masterfrom
nefelim4ag:stepcompress-cache

Conversation

@nefelim4ag
Copy link
Collaborator

@nefelim4ag nefelim4ag commented Feb 7, 2026

This is a derivative of my work around the step compression code.
I hope those changes make sense on their own.

First changes change where we compute max_error, which produces exactly the same result as the current master code.
It is a "bit" faster, but I would say within the measurement error of ~0.9%.

I think this is because the current compressor works close to O(n) in practice, or stepcompress takes too little amount of time from batch computations.

Second changes, exploits all the knowledge, and simply removes the occasional increase of max_error.
Which simplifies the logic - we no longer compute it during compression, and decreases speed jumps a bit.

The first commit makes any future changes slightly simpler, and if those changes make more complicated repetitive computations - cheaper.

If we worry about increased memory usage, it is possible to simply reduce the maximum buffer size in half.
Due to itersolve quantization, practically most of the "count" will fit within the "max_error" as the max count.
Probably we can limit it to some specific value: https://www.klipper3d.org/Protocol.html#variable-length-quantities
Within the compressor 65535 -> 12287.
Total step count in memory, in half 65535 -> 32768, to keep the memory usage as before.

Some step frequency data

So, if one decodes the serial data, peak values for count per cmd is:

./klippy/parsedump.py stepcompress/150mhz.dict stepcompress/150Mhz/master.serial > stepcompress/150Mhz/master.log
...
$ grep queue_step stepcompress/012Mhz/master.log | grep count= | cut -d' ' -f4 | sed 's/=/ /g' | sort -h -k2 -r | head -n 20
count 6723
count 6391
count 6303
$ grep queue_step stepcompress/064Mhz/master.log | grep count= | cut -d' ' -f4 | sed 's/=/ /g' | sort -h -k2 -r | head -n 20
count 10002
count 9205
count 9170
$ grep queue_step stepcompress/150Mhz/master.log | grep count= | cut -d' ' -f4 | sed 's/=/ /g' | sort -h -k2 -r | head -n 20
count 9184
count 9150
count 9044

How to extract graph data:

grep queue_step stepcompress/150Mhz/master.log | grep count= | cut -d' ' -f4 | sed 's/=/ /g' | sort -h -k2 -r | uniq -c | sort -h -k1 -r > /tmp/count_freq_150mhz.stats

If I graph steps per cmd vs occurrence frequency, it will look like this:
image
image
image

Thanks,
-Timofey


The above data was done in the same circumstances as here: https://klipper.discourse.group/t/improved-stepcompress-implementation/3203/50?u=nefelim4ag
So, batch mode, RP2040, adjusted MCU frequency (12, 64, 150MHz) and 64 microsteps.

I can collect data for any MCU frequency, any microstepping, or any model if it is necessary.


Btw, technically, the max error is often half of the interval, so it is possible to use it for velocity/knot analysis, but I didn't get a good/cheap way to do so right now.

@KevinOConnor
Copy link
Collaborator

Thanks. I'm a bit confused though. Why would we want to expand the memory size to store a value that is already reliably calculated from existing memory? I feel like I'm missing something - why store this value? What's the end user visible change?

The stepcompress.c queues were optimized for memory size. That is why it plays the weird games with int32_t instead of using the more natural int64_t.

-Kevin

@KevinOConnor
Copy link
Collaborator

Just to be clear, the queues were optimized for memory - both for overall memory size and for cache-line size. So, for example, if the code has to analyze 4000 steps that would be 16000 bytes in the current code - which could use a good chunk of the L1 cache on a typically SBC. This change would double the size to 32000 and notably increase the pressure on the L1 cache.

So, I feel like I'm missing something.

Cheers,
-Kevin

@nefelim4ag
Copy link
Collaborator Author

I feel like I'm missing something - why store this value?

My general logic about those changes is:

  • There is a bisect loop that repeatedly recalculates values.
  • I can't optimize the loop.
  • I can reduce the amount of branching and fixed cost per step calculation by doing max_error once.

What's the end user visible change?

It is too low-level for any user-visible changes.
Therefore, if I try to explain my "Big Picture" view.
I think that any (future) increase in compression precision (reduction of speed jumps) can benefit:

  • IS smoothers
  • Any stepper-related shapers, because they produce "small" output changes.
  • (speculation on my side) Driver's logic, closed loop driver's behaviour.

Any increase in precision will increase the computational cost.
These changes reduces that cost beforehand.

... which could use a good chunk of the L1 cache on a typically SBC. This change would double the size to 32000 and notably increase the pressure on the L1 cache.

I did think about the cache hits.
I think that, as we mostly read linearly, prefetch should mostly negotiate the difference.
Alas, on my hardware RPI5 CPU.
There is no measurable impact.
Because of resolution limitations, a long compressed sequence seems unlikely, but it is the worst-case scenario for CPU caches.

That said, it is possible that it will behave differently on a weak SBC; I cannot know how different it will be.

If you think this is not worth it, I'm okay with that.

Regards,
-Timofey

@KevinOConnor
Copy link
Collaborator

Ah, okay. So, it's an optimization to reduce cpu instructions.

For what it is worth, in my experience optimizing for memory cache yields much larger benefits than optimizing for cpu instructions.

If this change can be shown to measurably improve performance then I'm fine with making the change, but I'd be surprised if it notably improves performance. C compilers and modern CPUs are really good at this type of math and small branches.

Cheers,
-Kevin

@KevinOConnor
Copy link
Collaborator

KevinOConnor commented Feb 7, 2026

That said, introducing struct step as you have done here is probably a good idea. (That is, struct step with just the time member.) It gives the compiler more information (better ability to detect aliases) and would likely make the code a little easier to read.

Cheers,
-Kevin

EDIT: If you decide to do that (no pressure), how about struct qstep with member clock32 for names?

@nefelim4ag
Copy link
Collaborator Author

I will simply split it with the name refactoring, because I agree, it will simplify the understanding of what is happening here.
I think I'll appropriate your naming suggestion.

Thanks,
-Timofey

This change was initially done to make my repetitive "fit" computations cheaper, like if it was +17%, with cache it is ~15%
I will try to collect some performance data from different SBCs.
To get the idea of the impact of this specific change.

@nefelim4ag nefelim4ag marked this pull request as draft February 7, 2026 22:09
Max error is mostly a static value.
Defined by either external value or inter-step time.
Procompute it once to reduce the time cost on each compressor loop.

It works the same as before.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
On every cycle last step happens at the time or before.
New max_error is equal to or larger than before.
Simplify logic by avoiding re-computing it.
It have insegnificant impact on the compressed output.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
@github-actions
Copy link

Thank you for your contribution to Klipper. Unfortunately, a reviewer has not assigned themselves to this GitHub Pull Request. All Pull Requests are reviewed before merging, and a reviewer will need to volunteer. Further information is available at: https://www.klipper3d.org/CONTRIBUTING.html

There are some steps that you can take now:

  1. Perform a self-review of your Pull Request by following the steps at: https://www.klipper3d.org/CONTRIBUTING.html#what-to-expect-in-a-review
    If you have completed a self-review, be sure to state the results of that self-review explicitly in the Pull Request comments. A reviewer is more likely to participate if the bulk of a review has already been completed.
  2. Consider opening a topic on the Klipper Discourse server to discuss this work. The Discourse server is a good place to discuss development ideas and to engage users interested in testing. Reviewers are more likely to prioritize Pull Requests with an active community of users.
  3. Consider helping out reviewers by reviewing other Klipper Pull Requests. Taking the time to perform a careful and detailed review of others work is appreciated. Regular contributors are more likely to prioritize the contributions of other regular contributors.

Unfortunately, if a reviewer does not assign themselves to this GitHub Pull Request then it will be automatically closed. If this happens, then it is a good idea to move further discussion to the Klipper Discourse server. Reviewers can reach out on that forum to let you know if they are interested and when they are available.

Best regards,
~ Your friendly GitIssueBot

PS: I'm just an automated script, not a human being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants