Skip to content

meshletutils: Improve computeMeshlet/ClusterBounds performance#1022

Merged
zeux merged 4 commits intomasterfrom
ml-bounds
Feb 23, 2026
Merged

meshletutils: Improve computeMeshlet/ClusterBounds performance#1022
zeux merged 4 commits intomasterfrom
ml-bounds

Conversation

@zeux
Copy link
Owner

@zeux zeux commented Feb 21, 2026

This change adjusts the implementation of meshopt_computeMeshletBounds and meshopt_computeClusterBounds to be more efficient.

Instead of copying the corner positions to stack arrays, we use the vertex indices to index the original array directly. This significantly reduces stack usage, but slightly regresses performance - however, it's important to be able to do the next optimization with reasonable stack space.

When computing meshlet bounds, we used to replicate corner positions for each triangle corner; a typical meshlet that has, say, 64V/96T, would compute the bounding sphere for an array of 288 positions, despite only 64 of them being unique. With the index inputs, we can directly pass meshlet_vertices slice to computeBoundingSphere instead. This makes meshopt_computeMeshletBounds 1.5-1.7x faster end-to-end.

When computing cluster bounds, we don't have a readily available deduplicated index array. While we could use meshopt_extractMeshletIndices, we don't need the precisely deduplicated array, and a best-effort conservative deduplication is sufficient. We can use the same direct mapped cache as a filter, but append a corner index on all misses in the cache; in practice this ends up filtering most duplicates at a smaller cost. As a result, meshopt_computeClusterBounds is also 1.5x+ faster.

Finally, the aggregate stack consumption got significantly smaller; previously, meshopt_computeClusterBounds would need ~25 KB stack space, and meshopt_computeMeshletBounds would need ~31 KB. After this change, meshopt_computeMeshletBounds needs ~15 KB stack and meshopt_computeClusterBounds needs ~17 KB. These numbers are inclusive of internal functions and measured on a 64-bit debug build.

This contribution is sponsored by Valve.

zeux added 4 commits February 20, 2026 20:54
Instead of copying the input points to stack arrays, use the provided
indices directly to compute the bounding sphere around all corners.

This is a little more expensive (1-2%) due to extra branches and
multiplication overhead if the function doesn't get inlined, but it
allows us to avoid allocating large worst case size arrays on the stack,
and provides opportunity for further optimizations.

To avoid the dependency on triangle-corner mapping, we now store the
full plane equation (normal+d) for each non-degenerate triangle. Also,
corners of degenerate triangles are now included into the bounding
sphere - which is probably *more* correct if anything, but should not
matter in practice as the degenerate triangles usually share vertices
with non-degenerate ones.
Instead of using triangle corners as the source of data for the bounding
sphere, use corner indices when calling meshopt_computeMeshletBound.
Because our input is a meshlet, the vertices are already easily
available via the meshlet_vertices array; while we don't have the number
of elements, it's easy to compute from the triangle array.

In typical meshlets the number of vertices is 3-4x smaller than the
number of corners, and this makes bounds computation significantly
faster, by 1.5x or faster depending on the cache effects.
When computing cluster bounds from raw index data, we don't have the
meshlet structure; however, we can use a similar cache structure to the
one we use in meshopt_extractMeshletIndices to deduplicate the indices
on the fly.

Because this is simply a performance optimization, it pays off to do a
simpler cache, that just tracks presence of each vertex (not position)
and has no slow path; if the vertex collides with the previous one in
the cache, we push the potential duplicate to the output.

For additional performance, the append sequence is branchless, as this
branch is difficult to predict; an extra unused element in the output
corners[] array makes it easy to implement.

This makes meshopt_computeClusterBounds ~1.5x faster or more depending
on the cache behavior; the performance gains are similar to the previous
change in meshopt_computeMeshletBounds as the vertex filtering is very
cheap.
Both of the adustments here were always implied through meshlet/cluster
data construction but were never explicit. We currently are not relying
on the 256 unique vertex index limit, but it might be needed in the
future if implementation is refined further, so might as well note it
down.
@zeux zeux changed the title meshletutils: Improve meshopt_computeMeshlet/ClusterBounds performance meshletutils: Improve computeMeshlet/ClusterBounds performance Feb 21, 2026
@zeux zeux merged commit 26ef1c8 into master Feb 23, 2026
13 checks passed
@zeux zeux deleted the ml-bounds branch February 23, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant