Conversation
Instead of copying the input points to stack arrays, use the provided indices directly to compute the bounding sphere around all corners. This is a little more expensive (1-2%) due to extra branches and multiplication overhead if the function doesn't get inlined, but it allows us to avoid allocating large worst case size arrays on the stack, and provides opportunity for further optimizations. To avoid the dependency on triangle-corner mapping, we now store the full plane equation (normal+d) for each non-degenerate triangle. Also, corners of degenerate triangles are now included into the bounding sphere - which is probably *more* correct if anything, but should not matter in practice as the degenerate triangles usually share vertices with non-degenerate ones.
Instead of using triangle corners as the source of data for the bounding sphere, use corner indices when calling meshopt_computeMeshletBound. Because our input is a meshlet, the vertices are already easily available via the meshlet_vertices array; while we don't have the number of elements, it's easy to compute from the triangle array. In typical meshlets the number of vertices is 3-4x smaller than the number of corners, and this makes bounds computation significantly faster, by 1.5x or faster depending on the cache effects.
When computing cluster bounds from raw index data, we don't have the meshlet structure; however, we can use a similar cache structure to the one we use in meshopt_extractMeshletIndices to deduplicate the indices on the fly. Because this is simply a performance optimization, it pays off to do a simpler cache, that just tracks presence of each vertex (not position) and has no slow path; if the vertex collides with the previous one in the cache, we push the potential duplicate to the output. For additional performance, the append sequence is branchless, as this branch is difficult to predict; an extra unused element in the output corners[] array makes it easy to implement. This makes meshopt_computeClusterBounds ~1.5x faster or more depending on the cache behavior; the performance gains are similar to the previous change in meshopt_computeMeshletBounds as the vertex filtering is very cheap.
Both of the adustments here were always implied through meshlet/cluster data construction but were never explicit. We currently are not relying on the 256 unique vertex index limit, but it might be needed in the future if implementation is refined further, so might as well note it down.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change adjusts the implementation of
meshopt_computeMeshletBoundsandmeshopt_computeClusterBoundsto be more efficient.Instead of copying the corner positions to stack arrays, we use the vertex indices to index the original array directly. This significantly reduces stack usage, but slightly regresses performance - however, it's important to be able to do the next optimization with reasonable stack space.
When computing meshlet bounds, we used to replicate corner positions for each triangle corner; a typical meshlet that has, say, 64V/96T, would compute the bounding sphere for an array of 288 positions, despite only 64 of them being unique. With the index inputs, we can directly pass
meshlet_verticesslice tocomputeBoundingSphereinstead. This makesmeshopt_computeMeshletBounds1.5-1.7x faster end-to-end.When computing cluster bounds, we don't have a readily available deduplicated index array. While we could use
meshopt_extractMeshletIndices, we don't need the precisely deduplicated array, and a best-effort conservative deduplication is sufficient. We can use the same direct mapped cache as a filter, but append a corner index on all misses in the cache; in practice this ends up filtering most duplicates at a smaller cost. As a result,meshopt_computeClusterBoundsis also 1.5x+ faster.Finally, the aggregate stack consumption got significantly smaller; previously,
meshopt_computeClusterBoundswould need ~25 KB stack space, andmeshopt_computeMeshletBoundswould need ~31 KB. After this change,meshopt_computeMeshletBoundsneeds ~15 KB stack andmeshopt_computeClusterBoundsneeds ~17 KB. These numbers are inclusive of internal functions and measured on a 64-bit debug build.This contribution is sponsored by Valve.