Dynamic Parallax Occlusion Mapping with Approximate Soft Shadows (LightingSP_POM.j3md)
The switches to play with are:
The settings in Materials/Rock.j3m (preview not always working)
TestSinglePassLighting.NUM_LIGHTS (if the shader doesn’t compile, set this to 1)
tl;dr? Jump to *** interesting part ***
I’m currently working on a slim single pass lighting system optimized for weak hardware (integrated or just old). It builds on what @androlo posted here. I’m trying to make it as compatible as possible to Lighting.j3md in terms of parameters and rendering output. It will support an arbitrary number of lights by passing the actual number of lights as a #define to the shader at compile time. It might also be a solution to pass the number of lights as an uniform and #define MAX_LIGHTS to prevent re-compilation if the number of lights changes. Anyway, I’ll commit it to google code as soon as it is a little more mature.
I know you guys designed jME3 for high end hardware, but there are a lot of good reasons for me to start working on this. My employer cut the nVidia GPU from my company notebook and jME3 doesn’t run very well on a “Mobile Intel® 4 Series Express Chipset Family”, neither does it on my 7 year old notebook (ATi Mobility Radeon 9700). Even current netbooks and office PCs run horrible (except FixedFunc, but I don’t want to be stuck at it any longer), although they claim to support shader model >= 2 and DirectXzy.
So I read some docs like “Fast mobile shaders” and tested what really kills those GPUs. The results are in some cases not as expected. The number one killer for mobile and integrated GPUs is said to be multi pass rendering due to their limited bandwidth and shared memory architecture. Lighting.j3md does a render pass for all ambient lights plus one for each non-ambient light. To make it short: Yeah, single pass rendering boosts fps a lot on weak hardware.
Another important thing the Unity3D guys pointed out is the “frequency of computation”. “There are way more pixels than vertices, and way more vertices than objects. This means that you can afford only so much computation per pixel; somewhat more on vertices; and a lot more per-object.” The logic conclusion would be: Do as much as possible in the vertex shader (since the code there is executed a lot less often than that in the fragment shader) and let the interpolator (varying) work for you. Unfortunately, this is not true for integrated GPUs (especially from Intel). The article “Optimizing for integrated graphics cards” foreshadows what I found out in my experiments.
*** interesting part ***
The results can be seen in this post on Intel dev forum. Here is a picture of the table posted there (can’t insert HTML here) in case something bad happens to that posting.
My test scene is a sphere. The resolution is 640x480 without multisampling. There are three directional lights in the scene.
You can see that the fps of the Intel GPU roughly halfes from sphere segmentation 32 to 33 if the shader does a lot in the vertex unit. This is most probably because the Intel driver chooses to execute the vertex shader on the CPU (SWVP vs. HWVP).
I’m planning to also release a benchmarking program with some test shaders so that people can simply identify what makes rendering slow on their GPU. Maybe we can build a little database to improve shaders for a wide variety of hardware.
I have to say I am somewhat surprised. In the past the single pass shader was the primary one and when too many lights were used, the instruction limit was hit on those integrated GPUs (or was it the varying limit?). In any case, it would be interesting to see the shader
Wow very sophisticated. I’ll most definitely use this also for my grass system, because it has only vertex lighting on the grass quads and needs to do a bunch of weird non-light stuff like animation etc.
You are absolutely right. There are limits. I almost forgot about that. The number of varyings does not scale with the number of lights since I only pass position, normal and texcoord to the fragment shader and do everything else there. The other limits differ from GPU to GPU and are quite interesting to explore.
My old Radeon 9700M suffers the most from the instruction limit. I guess loops are unrolled and functions are inlined on this GPU. The second limit is that uniform arrays like g_LightColor can only have up to 128 elements (compiler error otherwise). In addition, every element beyond  seems to be vec4(0.0) for some reason. But this limit is hardly relevant because of the instruction limit. At the moment, I can haz 10 light sources and it renders Sphere(256, 256) at 81 fps (Lighting.j3md at 11 fps).
My Intel candidate doesn’t seem to suffer from an instruction limit or there’s no unrolling / inlining. The DevGuide also says “unlimited” for SM 4.0. No idea if that somehow transfers to GLSL120. The problematic limit here is the overall uniform limit. At the moment I can haz 35 lights and it renders Sphere(256, 256) at 14 fps (Lighting.j3md at 0 fps).
To test if there is an instruction limit or a uniform limit, I used a block like this:
uniform vec4 g_LightPosition[NUM_LIGHTS];
uniform vec4 g_LightColor[NUM_LIGHTS];
uniform vec4 g_LightColor2[NUM_LIGHTS];
uniform vec4 g_LightColor3[NUM_LIGHTS];
for (int i = 0; i < NUM_LIGHTS; i++)
finalColor += doLighting(N, E, g_LightPosition, g_LightColor);
doLighting2(N, E, g_LightPosition, g_LightColor, finalColor);
doLighting2(N, E, g_LightPosition, g_LightColor2[NUM_LIGHTS-i-1], finalColor);
doLighting2(N, E, g_LightPosition, g_LightColor3[i/2], finalColor);
doLighting2(N, E, g_LightPosition, g_LightColor[i/4], finalColor);
doLighting2(N, E, g_LightPosition, g_LightColor2[i/8], finalColor);
doLighting2(N, E, g_LightPosition, g_LightColor3[i/16], finalColor);
finalColor += g_LightColor3[NUM_LIGHTS-1];
// … insert more stuff here to go for the instruction limit
The GLSL compiler seems to be smart enough to remove unused uniforms. If I remove the “2” and “3” in the function call, the instruction count stays the same but there are less uniforms used. I used different indexes to prevent compiler magic.
Finally, the only limit my nVidia GTX 275 seems to have to face is the element limit of 256 for uniform arrays (g_LightPosition). One can circumvent this by simply defining “g_LightPosition2[NUM_LIGHTS - 256]”. I did not encounter any other limit. This GPU is a real beast. At the moment I can haz 35 lights and it renders Sphere(256, 256) at 93 fps (Lighting.j3md at 5 fps).
Let’s see where the journey takes us. At the moment, I have a brain parallax from trying to understand parallax mapping, but it’s just a matter of time to kill this brain bug.
This sounds like it could be interesting to the devs when they begin working on deferred lighting. There’s so much details.
Anyways, I’m no expert on this but if you want to delegate work sometimes, maybe easier stuff, I would not mind helping out at all. Or if you just want some extra data points (got 2 laptops, one with an older Intel HD card, one with a newer Radeon HD card). I read all the links here and will keep updated. You know where to @find me.
The settings in Materials/Rock.j3m (textures included, preview not always working)
TestSinglePassLighting.NUM_LIGHTS (if the shader doesn’t compile, set this to 1)
The class MaterialSP extends Material, similar to @androlo’s FMaterial. In addition, I set the define NUM_LIGHTS to the actual number of lights in the scene and fill the g_LightColor[NUM_LIGHTS], g_LightPosition[NUM_LIGHTS] and g_LightDirection[NUM_LIGHTS] arrays.
“LightingSP” is not so slim as it used to be, but I think it’s better to have a well structured, readable version to begin with. In fact I had two bugs in my early version that let me have more lights on my old Radeon 9700M than now. The first one was that I didn’t transform the L vector to view space. It’s not noticeable unless you move the camera (which the guy who wrote this tutorial I followed obviously didn’t). The fix brought an additional matrix multiplication for every light which has a great impact on instruction limited GPUs.
The second bug shows how intelligent the compiler is. Take a look at this code:
In line 9, instead of “Ispec *= m_Specular;”, it was “Ispec = m_Specular;”. The compiler was so smart to not execute the complicated term “vec4 Ispec = lightColor * pow(max(dot(R, E), 0.0), m_Shininess);” above, because it knew this value would be overridden anyway. So after fixing this bug, too, the light limit on my Radeon is now at 2! This is why there is this “NEED_SPECULAR”. If you have set UseMaterialColors but not defined a specular color, then it will not include the specular code to make room for more light sources. By the way, it might be nice to integrate GLSL Optimizer into jME3.
I still have a bug in normal mapping, although I think I do everything just like “Lighting”. I also noticed a strange thing about “Lighting”. My test program sheds light on 4 different spots. If you increase the number of lights to 200 it will have 50 lights per spot and the intensity is divided by 200 so that the result should not change and it doesn’t with “LightingSP”. With “Lighting”, it looks like this with 20, 40, 120, 250 lights:
After getting a bit “lost in space” I finally managed to get normal mapping working. In an earlier version, I transformed the normal map from tangent space to view space. But since the lights are passed in world space, I have to transform them anyway. So now I do lighting in tangent space, if a normal map is present.
I wonder if I could do lighting in world space. If there are many lights, transforming them per fragment is a huge slow down and because of loop unrolling, it also hits the instruction limit of 64 on older hardware.
Unfortunately, I lost some of my initial goals on the way. “WorldSpaceLightingSP” is still a bit faster than “Lighting”, but it also still lacks features like attenuation, spot light and parallax mapping. It’s absolutely possible that it will turn out slower in the end, because doing stuff that can be done per vertex is done per fragment to support many lights. On modern, non-integrated hardware with high bandwidth, doing one pass for each light is not much slower than doing unnecessary lighting stuff in fragment shader. Thanks to switching lighting to world space, this unnecessary stuff is just multiplying light color with material color for ambient, diffuse and specular. Maybe I could pass “gl_LightProducts” as uniform. I don’t have to transform the lights anymore.
As I said, I will try to optimize like here using SIMD commands. At the moment, I can only have 3-9 lights on my old Radeon 9700M (instruction limit: 64). 3 = all features enabled, 9 = no textures and no specular light. My plan is to get ambient + 4 or 8 lights (because 4 this is the number that can be parallelized via SIMD) working on my old ATI in a single pass and do more lights in additional passes if needed. You see, I’m getting ever closer to “Lighting”.
My Intel GMA suffers the most from multi pass. It also suffers from doing stuff in the vertex shader at all, because it’s done on the CPU and the varyings have to be passed (low bandwidth) to the fragment shader on the GPU. At the moment, “WorldSpaceLightingSP” runs ok on Intel, but it could run better if I’d do even less stuff in the vertex unit. And multi pass completely kills Intel GMA. So I might end up using a different shader or an “Intel switch” here. That means there has to be some GPU detection.
BTW: I added the filter techniques from “Lighting” to my shader, so SSAO works in the Sponza test.
Adaptive in-shader level-of-detail system implementation. Compute the current mip level explicitly in the pixel shader and use this information to transition between different levels of detail from the full effect to simple bump mapping. See the above paper for more discussion of the approach and its benefits. (see: Tatarchuk-POM-SI3D06.pdf)
You can see the LoD transition. Maybe one could scale the LoD threshold by angle.
Point light is implemented in "WorldSpaceLightingSP_SIMD". I have to merge that. Point light with attenuation is implemented. Spot light is not yet implemented. I'm wondering if a multi pass shader that does 4 lights per pass would be good. The single pass shader suffers a bit from the overhead of doing stuff in the fragment unit that can be done in the vertex unit. The break even is somewhere between 2 and 4 lights. And with 4 lights, there are a lot of opportunities to use vector commands (SIMD).
The POM itself (without shadows) is a bit more efficient than steep parallax mapping, but with only one light, Lighting.j3md is a lot faster. With more than one light, the single pass shader becomes faster. Shadows slow down a lot. I think, it’s not meant to have such high bumps that would be better in geometry. It’s just for demo.
btw: Shadows are independent from pom and can be applied to classic and steep parallax mapping, too. And it’s easy to integrate pom and / or shadows into multi pass Lighting.j3md. Feel free to to that if you like (at jme devs).
as it run in 25fps, I consider the performance can be quite playable… Can you tell what parameter will affect the performance here (too lazy to read the article), as I just need the effect in specific scene…