WebGL Forward+ and Clustered Deferred Shading

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4

Jinxiang Wang
Tested on: Windows 11, AMD Ryzen 9 8945HS w/ Radeon 780M Graphics 4.00 GHz 32GB, RTX 4070 Laptop 8 GB

Live Demo

Demo Video/GIF

ClusterDemo.mp4

Features Implemented

Part 1:

Naive Pipeline
Clustered Forward+ Pipeline
Clustered Deferred Pipeline

Part 2:

G-buffer Optimization
a. Combined all G-buffers into 1 4-channel rgba32float buffer. One channel is unused
b. Replaced full screen vertex + fragment rendering pass with a compute pass
Compute pass bloom
a. Completed bright pixel extraction and 2-pass Gaussian blur.
b. Failed to add filtered color back to framebuffer because related implementations on WebGPU is a bit ambiguous.

Scene Spec

All test rendering is performed under the following specification:

Model	Resolution
Sponza	3412x1906

Naive Pipeline

Naive pipeline facilitates understanding the basic structure of this project. In general, this pipeline uses one simple render pass to complete the rendering, in which all lights in the light set will be evaluated for each fragment. The performance of this way of rendering is very poor when we have many dynamic light source in the scene.

Light Count	Naive (ms)
100	29
300	80
500	120
700	182
1000	260

Clustered Forward+

Clustered Forward+ is an improvement to traditional forward pipeline. The biggest difference is its utilization of compute pass to perform tile-based light culling. Also, it uses pre-z, an adittional render pass to populate depth buffer, to reduce overdraw.

The above picture illustrates how "cluster" works in this context. Basically, we divide view frustum into small clusters based on its pixel position and depth.

Then for each cluster, we loop through the light set to record the light index of which that will contribute to the final color. This is acchieved by calculating the intersection of the cluster's AABB with light's radius.

Finally, in the last render pass, instead of loop through the light set, we focus on the lights that will make actual contribution to the cluster's color. For each fragment that is valid, i.e. not being discard by prez depth, we find its 1-d tile index by

find its x and y tile segment index based on tile size and its fragCoord (pixel coordinate);
find its linear depth by reconstructing its view space position and compute its z tile segment index accordingly, depth = (rawDepth - nclip) / (fclip - nclip);
tileIdx = z * numTilesX * numTilesY + y * numTilesX + x.

Tile Index	Light Count (max = 1000)

Light Count	Naive (ms)	Forward+ (ms)
100	29	9
300	80	21
500	120	30
700	182	43
1000	260	56

The improvement is noticeable.

Clustered Deferred

Traditional deferred pipeline addresses two issues, overdraw and the time complexity of lighting. However, when a scene contains a large set of lights, memory bandwidth becomes the threshold. Interacting with a huge amount of lights from storage buffer is heavily performance consumable.

Applying the same concept introduced in the above method, we could clusterize the frustum and reduce the amount of lighting computation for each fragment. And unlike forward render pipeline, deferred pipeline only involves 1 vertex-heavy rendering pass. The performance of clustered deferred pipeline is slightly better than clustered forwared pipeline.

Albedo	Normal	World Position

Light Count	Naive (ms)	Forward+ (ms)	Clustered Deferred (ms)
100	29	9	9
300	80	21	14
500	120	30	20
700	182	43	19
1000	260	56	36

Optimized Deferred

We could take a further step of optimization on clustered deferred pipeline. Since multiple gbuffers is not feasible in some situation. It requires multiple scree-sized textures and multiple times read from texture memory - still texture memory and memory bandwidth issue.

Taking a look at our G-buffers, we realize that multiple G-buffers can actually be compressed into one texture with a 4-channel 32 uint precision format (actually 3 channel will suffice, but WebGPU only allows for 4-channel texture creation) by implementing the flowing rubrics:

x channel: 16 bit for normal.x, 16 bit for normal.y
y channel: 16 bit for normal.z, 16 bit for depth
z channel: 24 bit, 8 bit each for albedo.rgba
w channel: spare channel for roughness, metalic, etc.

G-buffer	Clustered Deferred	Optimized Deferred
Albedo (bgra8unorm)	26 MB	0 MB
Normal (rgba16float)	52 MB	0 MB
Position (rgba16float)	52 MB	0 MB
Unity (rgba32uint)	0 MB	104 MB
Total	130 MB	104 MB

	Clustered Deferred	Optimized Deferred
Read	3 times	1 time

Another optimization that we could take is to subsitute the last full screen renderpass with one compute pass. This provides a better flexibility on data structures we would like to apply in final rendering and gives more space for parallelism optimization.

Light Count	Naive (ms)	Forward+ (ms)	Clustered Deferred (ms)	Optimized Deferred (ms)
100	29	9	9	7
300	80	21	14	15
500	120	30	20	19
700	182	43	29	24
1000	260	56	36	32

Bloom

A general bloom effect consists of the following operations:

Extract pixels with high luminance
Down sample the bloom texture and blur
Up sample and add the result back to the original color

For the first part, thanks to hdmmY, I managed to extract the high luminance pixel, avoid potential pixel flickering and edge cut off.

However, many things are ambiguous and frustrating when implementing the second and the third part. Reasons includes that WebGPU does not support hardware level mipmap generation and requires additional compute/render pass to acchieve this, WebGPU has not yet read-write storage texture support in one compute pass and additional intermediate texture and compute pass needs to be created, etc.

2-passGaussian.mp4

Performance Analysis

Is one of them faster?

Optimized Deferred appears to be faster in most cases.

Is one of them better at certain types of workloads?

When workload is light, naive could become the most efficient one, as no complecated pass switching, texture binding, etc. is needed.
When workload is heavy, under most cases we can stick to optimized deferred, if hardware permits.
When multiple material and different way of shading is required, we should choose forward+ instead.

What are the benefits and tradeoffs of using one over the other?

Naive:
Benefits: simple to implement
Tradeoffs: poor behavior in complex scene

Forward+:
Benefits: support MSAA and different materials
Tradeoffs: 1 additional vertex heavy render pass

Clustered Deferred:
Benefits: less overdraw
Tradeoffs: high texture memory and memory bandwidth required, no MSAA support, no transparent material rendering

Optimized Deferred:
Benefits: less texture memory and memory bandwidth required
Tradeoffs: same as clustered deferred

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
results		results
scenes/sponza		scenes/sponza
src		src
.gitignore		.gitignore
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
style.css		style.css
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebGL Forward+ and Clustered Deferred Shading

Demo Video/GIF

Features Implemented

Part 1:

Part 2:

Scene Spec

Naive Pipeline

Clustered Forward+

Clustered Deferred

Optimized Deferred

Bloom

Performance Analysis

Is one of them faster?

Is one of them better at certain types of workloads?

What are the benefits and tradeoffs of using one over the other?

For any differences in performance, briefly explain what may be causing the difference.

Credits

About

Uh oh!

Releases

Packages

Languages

JinxiangW/Project4-WebGPU-Forward-Plus-and-Clustered-Deferred

Folders and files

Latest commit

History

Repository files navigation

WebGL Forward+ and Clustered Deferred Shading

Demo Video/GIF

Features Implemented

Part 1:

Part 2:

Scene Spec

Naive Pipeline

Clustered Forward+

Clustered Deferred

Optimized Deferred

Bloom

Performance Analysis

Is one of them faster?

Is one of them better at certain types of workloads?

What are the benefits and tradeoffs of using one over the other?

For any differences in performance, briefly explain what may be causing the difference.

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages