Project RealSlimShader (Single Pass Lighting and Multi Pass Parallel Lighting)

survivor · January 20, 2012, 11:23pm

Update (2012-07-31):

General bugfixes
POM optimized and fixed
Experimental ParallaxDepthCorrection
QDM removed / postponed

See here.

Download:
Repository: Google code
Snapshot: RealSlimShader-2012-07-31.zip.

----

*** previous updates ***
Bugfixes
SpotLight support for all Lighting_* shaders
Improved testst
AccumulationBuffer
MaterialEx with LightingRenderer
MultiPassParallelLightingRenderer + Lighting_MPPLR.vert + Lighting_MPPLR.frag
Dynamic Parallax Occlusion Mapping with Approximate Soft Shadows (LightingSP_POM.j3md)

The switches to play with are:
The settings in Materials/Rock.j3m (preview not always working)
TestSinglePassLighting.SPHERE_SEGMENTS
TestSinglePassLighting.NUM_LIGHTS (if the shader doesn’t compile, set this to 1)

tl;dr? Jump to *** interesting part ***

I’m currently working on a slim single pass lighting system optimized for weak hardware (integrated or just old). It builds on what @androlo posted here. I’m trying to make it as compatible as possible to Lighting.j3md in terms of parameters and rendering output. It will support an arbitrary number of lights by passing the actual number of lights as a #define to the shader at compile time. It might also be a solution to pass the number of lights as an uniform and #define MAX_LIGHTS to prevent re-compilation if the number of lights changes. Anyway, I’ll commit it to google code as soon as it is a little more mature.

I know you guys designed jME3 for high end hardware, but there are a lot of good reasons for me to start working on this. My employer cut the nVidia GPU from my company notebook and jME3 doesn’t run very well on a “Mobile Intel® 4 Series Express Chipset Family”, neither does it on my 7 year old notebook (ATi Mobility Radeon 9700). Even current netbooks and office PCs run horrible (except FixedFunc, but I don’t want to be stuck at it any longer), although they claim to support shader model >= 2 and DirectXzy.

So I read some docs like “Fast mobile shaders” and tested what really kills those GPUs. The results are in some cases not as expected. The number one killer for mobile and integrated GPUs is said to be multi pass rendering due to their limited bandwidth and shared memory architecture. Lighting.j3md does a render pass for all ambient lights plus one for each non-ambient light. To make it short: Yeah, single pass rendering boosts fps a lot on weak hardware.

Another important thing the Unity3D guys pointed out is the “frequency of computation”. “There are way more pixels than vertices, and way more vertices than objects. This means that you can afford only so much computation per pixel; somewhat more on vertices; and a lot more per-object.” The logic conclusion would be: Do as much as possible in the vertex shader (since the code there is executed a lot less often than that in the fragment shader) and let the interpolator (varying) work for you. Unfortunately, this is not true for integrated GPUs (especially from Intel). The article “Optimizing for integrated graphics cards” foreshadows what I found out in my experiments.

*** interesting part ***

The results can be seen in this post on Intel dev forum. Here is a picture of the table posted there (can’t insert HTML here) in case something bad happens to that posting.

My test scene is a sphere. The resolution is 640x480 without multisampling. There are three directional lights in the scene.

ZCmzo.png656×518

I’ve tested three different systems:
DELL Latitude E6500 Notebook (Core2Duo 2.8 GHz, GMA X4500MHD)
7 year old Nexoc Osiris E604 Notebook (Pentium M 2.0 GHz, ATi Mobility Radeon 9700)
Gaming PC (AMD Phenom II X4 955 3.4 GHz, nVidia GTX 275)

Legend:
The three columns below a GPU mean: fixed-function pipeline / per vertex lighting / per fragment lighting
Values in brackets mean a multi-pass shader (Lighting.j3md), rest single pass
Green indicates a shader that does as little as possible in vertex shader, red makes normal use of the vertex shader

O4Zgz.png842×298

You can see that the fps of the Intel GPU roughly halfes from sphere segmentation 32 to 33 if the shader does a lot in the vertex unit. This is most probably because the Intel driver chooses to execute the vertex shader on the CPU (SWVP vs. HWVP).

I’m planning to also release a benchmarking program with some test shaders so that people can simply identify what makes rendering slow on their GPU. Maybe we can build a little database to improve shaders for a wide variety of hardware.

normen · January 20, 2012, 11:28pm

Great, sounds cool, gl with this!

Momoko_Fan · January 21, 2012, 5:05am

I have to say I am somewhat surprised. In the past the single pass shader was the primary one and when too many lights were used, the instruction limit was hit on those integrated GPUs (or was it the varying limit?). In any case, it would be interesting to see the shader

nehon · January 21, 2012, 9:14am

Very interesting. This could really benefit lighting on android. I already use per vertex lighting on android because per fragment kills the gpu

But from your table it’s also a benefit for high end GPU.

The question is…what’s the draw back? nothing comes for free i’m afraid.

Nice work, digging into that

androlo · January 21, 2012, 12:18pm

Wow very sophisticated. I’ll most definitely use this also for my grass system, because it has only vertex lighting on the grass quads and needs to do a bunch of weird non-light stuff like animation etc.

survivor · January 21, 2012, 2:28pm

You are absolutely right. There are limits. I almost forgot about that. The number of varyings does not scale with the number of lights since I only pass position, normal and texcoord to the fragment shader and do everything else there. The other limits differ from GPU to GPU and are quite interesting to explore.

My old Radeon 9700M suffers the most from the instruction limit. I guess loops are unrolled and functions are inlined on this GPU. The second limit is that uniform arrays like g_LightColor[] can only have up to 128 elements (compiler error otherwise). In addition, every element beyond [39] seems to be vec4(0.0) for some reason. But this limit is hardly relevant because of the instruction limit. At the moment, I can haz 10 light sources and it renders Sphere(256, 256) at 81 fps (Lighting.j3md at 11 fps).

My Intel candidate doesn’t seem to suffer from an instruction limit or there’s no unrolling / inlining. The DevGuide also says “unlimited” for SM 4.0. No idea if that somehow transfers to GLSL120. The problematic limit here is the overall uniform limit. At the moment I can haz 35 lights and it renders Sphere(256, 256) at 14 fps (Lighting.j3md at 0 fps).

To test if there is an instruction limit or a uniform limit, I used a block like this:

[java]

uniform vec4 g_LightPosition[NUM_LIGHTS];

uniform vec4 g_LightColor[NUM_LIGHTS];

uniform vec4 g_LightColor2[NUM_LIGHTS];

uniform vec4 g_LightColor3[NUM_LIGHTS];

//…

for (int i = 0; i < NUM_LIGHTS; i++)

{

finalColor += doLighting(N, E, g_LightPosition, g_LightColor);

doLighting2(N, E, g_LightPosition, g_LightColor, finalColor);

doLighting2(N, E, g_LightPosition, g_LightColor2[NUM_LIGHTS-i-1], finalColor);

doLighting2(N, E, g_LightPosition, g_LightColor3[i/2], finalColor);

doLighting2(N, E, g_LightPosition, g_LightColor[i/4], finalColor);

doLighting2(N, E, g_LightPosition, g_LightColor2[i/8], finalColor);

doLighting2(N, E, g_LightPosition, g_LightColor3[i/16], finalColor);

finalColor += g_LightColor3[NUM_LIGHTS-1];

// … insert more stuff here to go for the instruction limit

}

[/java]

The GLSL compiler seems to be smart enough to remove unused uniforms. If I remove the “2” and “3” in the function call, the instruction count stays the same but there are less uniforms used. I used different indexes to prevent compiler magic.

Finally, the only limit my nVidia GTX 275 seems to have to face is the element limit of 256 for uniform arrays (g_LightPosition[256]). One can circumvent this by simply defining “g_LightPosition2[NUM_LIGHTS - 256]”. I did not encounter any other limit. This GPU is a real beast. At the moment I can haz 35 lights and it renders Sphere(256, 256) at 93 fps (Lighting.j3md at 5 fps).

Let’s see where the journey takes us. At the moment, I have a brain parallax from trying to understand parallax mapping, but it’s just a matter of time to kill this brain bug.

androlo · January 21, 2012, 3:46pm

This sounds like it could be interesting to the devs when they begin working on deferred lighting. There’s so much details.

Anyways, I’m no expert on this but if you want to delegate work sometimes, maybe easier stuff, I would not mind helping out at all. Or if you just want some extra data points (got 2 laptops, one with an older Intel HD card, one with a newer Radeon HD card). I read all the links here and will keep updated. You know where to @find me.

survivor · January 22, 2012, 9:42pm

I know it takes me longer than someone who is more experienced, but the time is not wasted. I really learn a lot and it’s a lot of fun!

I have just committed my test project to Google code. A snapshot is also available as download.

The switches to play with are:

The settings in Materials/Rock.j3m (textures included, preview not always working)
TestSinglePassLighting.SPHERE_SEGMENTS
TestSinglePassLighting.NUM_LIGHTS (if the shader doesn’t compile, set this to 1)

The class MaterialSP extends Material, similar to @androlo’s FMaterial. In addition, I set the define NUM_LIGHTS to the actual number of lights in the scene and fill the g_LightColor[NUM_LIGHTS], g_LightPosition[NUM_LIGHTS] and g_LightDirection[NUM_LIGHTS] arrays.

“LightingSP” is not so slim as it used to be, but I think it’s better to have a well structured, readable version to begin with. In fact I had two bugs in my early version that let me have more lights on my old Radeon 9700M than now. The first one was that I didn’t transform the L vector to view space. It’s not noticeable unless you move the camera (which the guy who wrote this tutorial I followed obviously didn’t). The fix brought an additional matrix multiplication for every light which has a great impact on instruction limited GPUs.

The second bug shows how intelligent the compiler is. Take a look at this code:

[java]

// calculate Specular Term:

#if defined(MATERIAL_COLORS) && defined(SPECULAR)

#define NEED_SPECULAR

#endif

#if defined(NEED_SPECULAR) || defined(SPECULARMAP)

vec3 R = normalize(-reflect(L, N));

vec4 Ispec = lightColor * pow(max(dot(R, E), 0.0), m_Shininess);

#ifdef NEED_SPECULAR

Ispec *= m_Specular;

#endif

#ifdef SPECULARMAP

Idiff *= texture2D(m_SpecularMap, texCoord);

#endif

Ispec = clamp(Ispec, 0.0, 1.0);

#else

vec4 Ispec = vec4(0.0);

#endif

[/java]

In line 9, instead of “Ispec *= m_Specular;”, it was “Ispec = m_Specular;”. The compiler was so smart to not execute the complicated term “vec4 Ispec = lightColor * pow(max(dot(R, E), 0.0), m_Shininess);” above, because it knew this value would be overridden anyway. So after fixing this bug, too, the light limit on my Radeon is now at 2! This is why there is this “NEED_SPECULAR”. If you have set UseMaterialColors but not defined a specular color, then it will not include the specular code to make room for more light sources. By the way, it might be nice to integrate GLSL Optimizer into jME3.

I still have a bug in normal mapping, although I think I do everything just like “Lighting”. I also noticed a strange thing about “Lighting”. My test program sheds light on 4 different spots. If you increase the number of lights to 200 it will have 50 lights per spot and the intensity is divided by 200 so that the result should not change and it doesn’t with “LightingSP”. With “Lighting”, it looks like this with 20, 40, 120, 250 lights:

6KSC6.png640×640

I guess it has something to do with multi pass.

I’d be happy to receive feedback and tests on as many different GPUs as possible.

thetoucher · January 22, 2012, 10:07pm

nice work man!

4 lights :

http://i.imgur.com/3Y6rX.jpg

250 lights :

http://i.imgur.com/oRY1I.jpg

… GTX 470 (Win7)

at 400 lights it crapped out with some nasty looking shader errors, the most useful of which is probably :

line 848, column 17: error: out of bounds array access

line 850, column 25: error: offset for relative array access outside supported range

… but I just put that down to user error, who sets 400 lights :roll:

survivor · January 22, 2012, 10:13pm

@thetoucher: Did you increase the SPHERE_SEGMENTS to 256 (Sphere() has a bug above that). Edit: You didn’t. I can see it from vertex count.

My GTX 275 bails out at ~256 lights. So I thought [256] would be some hard limit of today’s GPUs. Good to know that it’s not. Ok, a GTX 275 is more like yesterday.

thetoucher · January 22, 2012, 10:27pm

just did …

300 segments

http://i.imgur.com/v9sSZ.jpg

400 segments

http://i.imgur.com/vnraz.jpg

it sort of makes sense though, at and above 256x256 you’re over the 16bit number limit… someone else can figure out where

pspeed · January 22, 2012, 10:27pm

I’m such a nerd. I saw the “four lights” and this is what I thought:

http://www.youtube.com/watch?v=moX3z2RJAV8

survivor · January 29, 2012, 2:25am

Small update:

After getting a bit “lost in space” I finally managed to get normal mapping working. In an earlier version, I transformed the normal map from tangent space to view space. But since the lights are passed in world space, I have to transform them anyway. So now I do lighting in tangent space, if a normal map is present.

I wonder if I could do lighting in world space. If there are many lights, transforming them per fragment is a huge slow down and because of loop unrolling, it also hits the instruction limit of 64 on older hardware.

I found two very interesting articles about that:

My own lighting shader with blackjack and hookers
Let There Be Light!: A Unified Lighting Technique for a New Generation of Games

Edit: I made a new shader “WorldSpaceLightingSP” which does lighting in world space. It’s ~25% faster at scenes with many lights. There’s also a Sponza test included.

g6jx0.png656×518

Now I’ll try to further optimize the shader by using more SIMD commands as described here.

Empire_Phoenix · January 29, 2012, 5:50am

So this is mostly the same as the standart shader, but way faster with mutiple lights, do I get this right?

What are the downsides on modern hardware where the instruction limit does not count?

survivor · January 29, 2012, 11:58am

Unfortunately, I lost some of my initial goals on the way. “WorldSpaceLightingSP” is still a bit faster than “Lighting”, but it also still lacks features like attenuation, spot light and parallax mapping. It’s absolutely possible that it will turn out slower in the end, because doing stuff that can be done per vertex is done per fragment to support many lights. On modern, non-integrated hardware with high bandwidth, doing one pass for each light is not much slower than doing unnecessary lighting stuff in fragment shader. Thanks to switching lighting to world space, this unnecessary stuff is just multiplying light color with material color for ambient, diffuse and specular. Maybe I could pass “gl_LightProducts[]” as uniform. I don’t have to transform the lights anymore.

As I said, I will try to optimize like here using SIMD commands. At the moment, I can only have 3-9 lights on my old Radeon 9700M (instruction limit: 64). 3 = all features enabled, 9 = no textures and no specular light. My plan is to get ambient + 4 or 8 lights (because 4 this is the number that can be parallelized via SIMD) working on my old ATI in a single pass and do more lights in additional passes if needed. You see, I’m getting ever closer to “Lighting”.

My Intel GMA suffers the most from multi pass. It also suffers from doing stuff in the vertex shader at all, because it’s done on the CPU and the varyings have to be passed (low bandwidth) to the fragment shader on the GPU. At the moment, “WorldSpaceLightingSP” runs ok on Intel, but it could run better if I’d do even less stuff in the vertex unit. And multi pass completely kills Intel GMA. So I might end up using a different shader or an “Intel switch” here. That means there has to be some GPU detection.

BTW: I added the filter techniques from “Lighting” to my shader, so SSAO works in the Sponza test.

androlo · January 29, 2012, 12:27pm

Spotlight slows down a lot.

survivor · March 5, 2012, 12:53am

Update:

I’ve implemented Dynamic Parallax Occlusion Mapping with Approximate Soft Shadows (see LightingSP_POM). The shader is parameter compatible to “Lighting.j3md”.

Soft shadows:

http://www.youtube.com/watch?v=y-I3LtwUhxI

Level of detail system:

Adaptive in-shader level-of-detail system implementation. Compute the current mip level explicitly in the pixel shader and use this information to transition between different levels of detail from the full effect to simple bump mapping. See the above paper for more discussion of the approach and its benefits. (see: Tatarchuk-POM-SI3D06.pdf)

http://www.youtube.com/watch?v=B7K1y5ZXRM8

You can see the LoD transition. Maybe one could scale the LoD threshold by angle.

~~Point light is implemented in "WorldSpaceLightingSP_SIMD". I have to merge that.~~ Point light with attenuation is implemented. Spot light is not yet implemented. I'm wondering if a multi pass shader that does 4 lights per pass would be good. The single pass shader suffers a bit from the overhead of doing stuff in the fragment unit that can be done in the vertex unit. The break even is somewhere between 2 and 4 lights. And with 4 lights, there are a lot of opportunities to use vector commands (SIMD).

And then there are Prism Parallax Occlusion Mapping with Accurate Silhouette Generation and A Prism-Free Method for Silhouette Rendering in Inverse Displacement Mapping.

thetoucher · March 5, 2012, 12:56am

really nice work . how is performance ?

survivor · March 5, 2012, 1:04am

Not good.

The POM itself (without shadows) is a bit more efficient than steep parallax mapping, but with only one light, Lighting.j3md is a lot faster. With more than one light, the single pass shader becomes faster. Shadows slow down a lot. I think, it’s not meant to have such high bumps that would be better in geometry. It’s just for demo.

btw: Shadows are independent from pom and can be applied to classic and steep parallax mapping, too. And it’s easy to integrate pom and / or shadows into multi pass Lighting.j3md. Feel free to to that if you like (at jme devs).

atomix · March 5, 2012, 2:09pm

This test should be in the show case of JME3 ! 8)

as it run in 25fps, I consider the performance can be quite playable… Can you tell what parameter will affect the performance here (too lazy to read the article), as I just need the effect in specific scene…