Poor performance with fragment discarding

Hello everyone,

I’ve got a fairly strange performance problem when discarding fragments in a PBRLighting fork. The geometry with which this problem is occurring is made up of a stack of 16 large quads in close proximity. I’m discarding fragments based on a displacement map to simulate the height of the material, much like how fur is done.

When the camera is very close to the geometry, rendering it takes roughly 0.5ms, but when the camera gets further away, it’s taking around 20ms (the geometry still takes up a good portion of screen space). When I comment out the discard statement in the shader, I get from 0.15ms to 1ms.

No lights, probes, or other meshes are present in the scene. My shader is set up to do gbuffer writes, but that is currently disabled.

#import "Common/ShaderLib/GLSLCompat.glsllib"

#if defined(DIFFUSE_GBUFFER) && defined(NORMALS_GBUFFER)
    #define GBUFFER_WRITE 1
#endif

// enable apis and import PBRLightingUtils
#define ENABLE_PBRLightingUtils_getWorldPosition 1
//#define ENABLE_PBRLightingUtils_getLocalPosition 1
#define ENABLE_PBRLightingUtils_getWorldNormal 1
#define ENABLE_PBRLightingUtils_getWorldTangent 1
#define ENABLE_PBRLightingUtils_getTexCoord 1
#define ENABLE_PBRLightingUtils_readPBRSurface 1
#ifndef GBUFFER_WRITE
    #define ENABLE_PBRLightingUtils_computeDirectLightContribution 1
    #define ENABLE_PBRLightingUtils_computeProbesContribution 1
#endif

#import "Common/ShaderLib/module/pbrlighting/PBRLightingUtils.glsllib"
#import "RenthylPlus/ShaderLib/GBuffers/PBRCompactModel.glsllib"

#ifdef DEBUG_VALUES_MODE
    uniform int m_DebugValuesMode;
#endif

uniform vec3 g_CameraPosition;

#ifdef USE_FOG
    #import "Common/ShaderLib/MaterialFog.glsllib"
#endif

uniform sampler2D m_DisplacementMap;
uniform vec2 m_DisplacementRange;
uniform int m_NumSlices;
uniform float m_StackHeight;

varying float sliceLayer;

float mapRange(float value, float fromMin, float fromMax) {
    return (value - fromMin) / (fromMax - fromMin);
}

#ifndef GBUFFER_WRITE
    uniform vec4 g_LightData[NB_LIGHTS];

    void computeLighting(inout PBRSurface surface) {
        // Calculate necessary variables from pbr surface prior to applying lighting. Ensure all texture/param reading and blending occurrs prior to this being called!
        //PBRLightingUtils_calculatePreLightingValues(surface);

        // Calculate direct lights
        for (int i = 0; i < NB_LIGHTS; i += 3) {
            vec4 lightData0 = g_LightData[i];
            vec4 lightData1 = g_LightData[i + 1];
            vec4 lightData2 = g_LightData[i + 2];
            PBRLightingUtils_computeDirectLightContribution(
                lightData0, lightData1, lightData2,
                surface
            );
        }

        // Calculate env probes
        PBRLightingUtils_computeProbesContribution(surface);

        // Put it all together
        gl_FragColor.rgb = vec3(0.0);
        gl_FragColor.rgb += surface.bakedLightContribution;
        gl_FragColor.rgb += surface.directLightContribution;
        gl_FragColor.rgb += surface.envLightContribution;
        gl_FragColor.rgb += surface.emission;
        gl_FragColor.a = surface.alpha; // this line seems to cost about 4ms

        #ifdef USE_FOG
            gl_FragColor = MaterialFog_calculateFogColor(vec4(gl_FragColor));
        #endif

        //outputs the final value of the selected layer as a color for debug purposes.
        #ifdef DEBUG_VALUES_MODE
            gl_FragColor = PBRLightingUtils_getColorOutputForDebugMode(m_DebugValuesMode, vec4(gl_FragColor.rgba), surface);
        #endif
    }
#endif

void main() {

    // discard layer fragments
    float height = texture2D(m_DisplacementMap, texCoord).r;
    height = mapRange(height, m_DisplacementRange.x, m_DisplacementRange.y);
    if (sliceLayer >= 0.0 && sliceLayer > height) {
        discard; // this line seems to cost about 14ms
    }

    vec3 wpos = PBRLightingUtils_getWorldPosition();
    vec3 worldViewDir = normalize(g_CameraPosition - wpos);

    // Create a blank PBRSurface.
    PBRSurface surface = PBRLightingUtils_createPBRSurface(worldViewDir);

    // Read surface data from standard PBR matParams. (note: matParams are declared in 'PBRLighting.j3md' and initialized as uniforms in 'PBRLightingUtils.glsllib')
    PBRLightingUtils_readPBRSurface(surface);

    #ifdef GBUFFER_WRITE
        GBufferWrite_writeSurfaceToGBuffers(surface);
    #else
        computeLighting(surface);
    #endif

    // visualize top and bottom layers for tuning displacement range
    #ifdef LAYER_USAGE_DEBUG
        if (sliceLayer >= 1.0) {
            gl_FragColor = vec4(0.0, 1.0, 0.0, 1.0);
        } else if (sliceLayer <= 0.0) {
            gl_FragColor = vec4(1.0, 0.0, 0.0, 1.0);
        }
    #endif

}

It’s also worth noting that commenting out gl_FragColor.a = surface.alpha gives back roughly 4ms with discarding enabled, even though surface.alpha is always 1.0. With both discarding and alpha writing disabled I get from 0.1ms to 0.8ms.

Edit: rendering only 1 quad rather than 16 makes the problem less noticeable with discarding and alpha writing enabled (only 0.15ms to 1.9ms), but there is still too much rise in rendering time. Disabling both discarding and alpha write with only 1 quad gives roughly the same time as with 16 quads (0.1ms to 0.8ms).

Edit: I swapped out my custom material for the standard PBRLighting material with the same material parameters (where possible), and I get a constant 0.09ms render time with one quad, and 1.5ms with 16 quads, regardless of the camera distance.

So, just to make sure I’m interpreting this last edit correctly.

Regular PBR lighting performs pretty well in all cases… but your custom shader doesn’t? (Which isn’t to cast aspersions against your shader, just trying to make sure we understand where to look.)

From reading a long time ago, I understand there are cases where discard can kill performance… but I never ran into them personally (yet) and so don’t remember specifics.

Curious what your effect looks like… just because it sounds cool. :slight_smile:

1 Like

The drop from 0.5 to 20ms already sounds suspicious if the amound of fragments processed is similar. I would investigate that isolated first.

If the amount is similar enough you could check the geometryqueue if you have to switch shaders a few times in case the geometry is further away.

I assume you checked that the shader is somehow not constantly recompiled.

1 Like

Yes, my shader is performing significantly worse compared to PBRLighting.

You’re right. After doing a little research and trying a bunch of things out, I think my performance problem is caused by poor pixel quad utilization when using discard. When the camera is close and looking directly at the mesh, my discarding algorithm leaves more pixel quads either fully discarded or fully visible, and so it runs faster. As the camera moves further away, more pixel quads experience partial discarding. This, at least, explains why moving further away kills performance, despite the same or fewer fragments being used.

Additionally, I got much better performance after enabling mipmapping on the displacement texture, which I believe caused the discarding to be less erratic when viewed from far distances.

I hadn’t thought of that. I’m not changing any material parameters after I initially create the material, and I’m not adding/removing lights either, so I’d expect not. Just in case, I checked the material’s sort id, and it’s only changing once over the course the program.

It is rather cool, imo. I’m obviously still working out kinks in the method, but the idea is to adapt a fur technique to generate real geometric detail on meshes… and without using a gazillion vertices in the process.

I’ll have to upload a video at some point… an image doesn’t fully capture the effect.

5 Likes

Yeah, I played with something like this some time back, too.

How are you doing the multiple quads?

For my case, I was taking one mesh and setting it up like it was using instancing… but where the buffers were just repeated for every instance… then using the instance ID to project the vertexes down the normal. So technically it was still one draw call and the effect could essentially be turned on/off by hitting one buffer on the mesh.

I cant remember if using discard affects the early fragment depth tests. I would say i has to, or the driver has to make sure that each layer is completely rendered before moving to the next. (In terms of actual shader hardware on the gpu)