meshletutils: Improve computeMeshlet/ClusterBounds performance by zeux · Pull Request #1022 · zeux/meshoptimizer

zeux · 2026-02-21T05:28:35Z

This change adjusts the implementation of meshopt_computeMeshletBounds and meshopt_computeClusterBounds to be more efficient.

Instead of copying the corner positions to stack arrays, we use the vertex indices to index the original array directly. This significantly reduces stack usage, but slightly regresses performance - however, it's important to be able to do the next optimization with reasonable stack space.

When computing meshlet bounds, we used to replicate corner positions for each triangle corner; a typical meshlet that has, say, 64V/96T, would compute the bounding sphere for an array of 288 positions, despite only 64 of them being unique. With the index inputs, we can directly pass meshlet_vertices slice to computeBoundingSphere instead. This makes meshopt_computeMeshletBounds 1.5-1.7x faster end-to-end.

When computing cluster bounds, we don't have a readily available deduplicated index array. While we could use meshopt_extractMeshletIndices, we don't need the precisely deduplicated array, and a best-effort conservative deduplication is sufficient. We can use the same direct mapped cache as a filter, but append a corner index on all misses in the cache; in practice this ends up filtering most duplicates at a smaller cost. As a result, meshopt_computeClusterBounds is also 1.5x+ faster.

Finally, the aggregate stack consumption got significantly smaller; previously, meshopt_computeClusterBounds would need ~25 KB stack space, and meshopt_computeMeshletBounds would need ~31 KB. After this change, meshopt_computeMeshletBounds needs ~15 KB stack and meshopt_computeClusterBounds needs ~17 KB. These numbers are inclusive of internal functions and measured on a 64-bit debug build.

This contribution is sponsored by Valve.

Instead of copying the input points to stack arrays, use the provided indices directly to compute the bounding sphere around all corners. This is a little more expensive (1-2%) due to extra branches and multiplication overhead if the function doesn't get inlined, but it allows us to avoid allocating large worst case size arrays on the stack, and provides opportunity for further optimizations. To avoid the dependency on triangle-corner mapping, we now store the full plane equation (normal+d) for each non-degenerate triangle. Also, corners of degenerate triangles are now included into the bounding sphere - which is probably *more* correct if anything, but should not matter in practice as the degenerate triangles usually share vertices with non-degenerate ones.

Instead of using triangle corners as the source of data for the bounding sphere, use corner indices when calling meshopt_computeMeshletBound. Because our input is a meshlet, the vertices are already easily available via the meshlet_vertices array; while we don't have the number of elements, it's easy to compute from the triangle array. In typical meshlets the number of vertices is 3-4x smaller than the number of corners, and this makes bounds computation significantly faster, by 1.5x or faster depending on the cache effects.

When computing cluster bounds from raw index data, we don't have the meshlet structure; however, we can use a similar cache structure to the one we use in meshopt_extractMeshletIndices to deduplicate the indices on the fly. Because this is simply a performance optimization, it pays off to do a simpler cache, that just tracks presence of each vertex (not position) and has no slow path; if the vertex collides with the previous one in the cache, we push the potential duplicate to the output. For additional performance, the append sequence is branchless, as this branch is difficult to predict; an extra unused element in the output corners[] array makes it easy to implement. This makes meshopt_computeClusterBounds ~1.5x faster or more depending on the cache behavior; the performance gains are similar to the previous change in meshopt_computeMeshletBounds as the vertex filtering is very cheap.

Both of the adustments here were always implied through meshlet/cluster data construction but were never explicit. We currently are not relying on the 256 unique vertex index limit, but it might be needed in the future if implementation is refined further, so might as well note it down.

zeux added 4 commits February 20, 2026 20:54

zeux changed the title ~~meshletutils: Improve meshopt_computeMeshlet/ClusterBounds performance~~ meshletutils: Improve computeMeshlet/ClusterBounds performance Feb 21, 2026

zeux merged commit 26ef1c8 into master Feb 23, 2026
13 checks passed

zeux deleted the ml-bounds branch February 23, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meshletutils: Improve computeMeshlet/ClusterBounds performance#1022

meshletutils: Improve computeMeshlet/ClusterBounds performance#1022
zeux merged 4 commits intomasterfrom
ml-bounds

zeux commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zeux commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant