(August 2016) Monthly WIP screenshot thread

I see :slight_smile: Actually, I think texture fetches kill the GPU even better.

I think that the light vector has 4 elements, because most matrices are 4x4. Not sure though…

Yes I’ve read that texture fetches cost more than a bunch of math because you have to access texture memory.

1 Like

So I should call the GPU a bad dog because it won’t fetch fast enough?

But seriously, remember that nebula dissolve shader I posted here last time that worked fine?

When I tested it out on my laptop with a 960M it tanked the fps from 60 to 40 when close up. It makes only one texture fetch at the start but apparently this for every pixel:

	color.rgb *= multColor;

	float avg = normSum+color.r+color.g+color.b;
	float spd = 0.8;

	#ifdef SPEED
		spd = m_Speed;

	color.a *= 0.55+sin(g_Time*spd+avg*5.0)*0.45+0.1;

is enough to completely destroy the framerate? I moved as much math as possible to the vert shader but it still wasn’t enough.

I ended up putting a checkbox in the options to disable the animation if needed. :confused:


This reminds me that I should test the new planet shader there too, to see if it explodes or something before proceeding.

1 Like

I’m not sure what you mean that it makes only one texture fetch at startup while something else for every pixel? Is the texture fetch in the vertex shader? If that’s the case it’ll run the texture fetch for every vertex, which is still a lot better than every pixel of course, but if it’s in the fragment shader then that texture fetch is for every pixel.

And if the texture fetch is in the vertex shader then your sine calculation could also be put into the vertex shader.

No I mean, it still uses a call to texture2D() in the frag shader as you can’t render a texture otherwise obviously. I meant that I only call it once in the beginning of main() just like un unshaded.j3md for example.

What I put into the vert shader is a dot product calculation and the adding up of the normal vector (normSum).

Out of curiosity have you tried running the shader without the sine calculation just to see if it makes a difference? As you’re getting closer to the shader it takes up more space on the screen so the fragment shader is executed more and more.

Okay it seems that I tested something wrong. As I was testing the planet shader just now on the laptop it appears that the benefit of turning of animated nebulas is…-2fps.

Yes it seems actually faster with the animation enabled now.

I have been proven an idiot yet again. By myself.

The problem must be something else then, most likely related to the large amount of transparent pixels on the quads I’d say.

1 Like

Are you discarding the 0 alpha pixels?

Trig functions can be expensive per pixel… but I guess they are optimized pretty well in modern GPUs. Like sqrt(), I generally try to avoid them but partially because I’ve been coding long enough to remember when add was faster than multiply.

I was gonna say I bet transparency plays a role here, transparency covering large portions of my screen kills my laptop. At the same time I’m using a Radeon 4100 Mobility which isn’t exactly top of the line.

Nonetheless you’re obviously not an idiot, that game you’re working on is clearly very complicated and well done.

Are you using alpha discard or just letting them all render?

I ask because I actually want to know and not to steer you towards one or the other. Using discard at all in a shader can have performance implications but so can rendering a bunch of transparent pixels.

I use discard yes, but I wasn’t talking about shaders I’ve made, transparency covering a decent amount of my screen in commercial games kills my laptop.

Yep, I am. Last time I checked it made close to no difference (even tried higher than 0.0f but it just looked ugly while still being slow).

I just had a hunch that it might be the bloom filter, so I threw the nebula geometries into the translucent bucket. Made only like 5 fps difference if any at all. Wat.

I see the #ifdef SPEED in there. Out of curiosity will that #ifdef statement change from true to false and back again from frame to frame? If so whenever it changes it forces a recompilation of the shader.

Nope, it’s just set when the nebula is created (the only thing that ever changes during runtime is the animation on/off boolean when you press the checkbox in the settings), but I left it like that so there’s a default speed in case it wasn’t defined in some case.

Right on, just making sure :wink:

Also good to note, for those that don’t know, GPUs are what’s known as Single Instruction Multiple Data or SIMD. Every pixel is calculated using the same instruction set with different data, such as UV coordinates. This means that things get tricky when you have an if statement, regular if statements not #ifdef. If all pixels take the same path in the if statement it’s not a problem, but if even one pixel deviates then BOTH sides of the if statement are processed and the correct result is determined when the shader is done running. This is because we’re using SIMD, every pixel must be run through the same instruction set.

In other words for the most part an if statement in a shader will not make things more efficient by skipping over code that need not be run for a particular pixel, it’s still executed.

Yeah the non-preprocessor if sentences are supposed to be super high intensive, which is why this is hilarious:

#if defined(DISCARD_ALPHA)
    if(color.a < m_AlphaDiscardThreshold){

Like literally, doesn’t that do more harm than good?

Not all if statements are bad, which is why you can use them. That particular if statement is not a bad idea. While it is true that both sides are evaluated when executing the shader, it does prevent the GPU from performing blending on that pixel based on the alpha value after the shader is executed.

1 Like

My understanding is that this is GPU dependent. Older GPUs definitely did not support branching and so had to do as you say… but modern GPUs actually do support some branching and so it can be fine.

Also note that the parallelization is at the fragment level and not the pixel level. A given pixel may have many fragments written.

…which is why I asked about discard. Normally, before the fragment is processed, the depth is checked to see if it even needs to process that fragment. (JME renders the opaque bucket front-to-back for this reason.) However, fragment shaders that use discard or write to depth will cause this optimization not to be run because the GPU can no longer predict what depth will be written by the N number of pixels it happens to be writing at that time.

At least that’s my understanding. Mostly I just try stuff and put lots of #ifdefs for turning things on and off.

1 Like

I tend to use #ifdef myself where possible. And it’s not poor practice to use an if statement in a shader, usually you need an if statement if it’s going to determine the outcome of the pixel, but an if statement that exists only to save on processing time may very well end up costing even more.

GPUs don’t process all pixels in parallel, usually they process blocks of 4x4 or 8x8 using SIMD. Modern GPUs have gotten better at predicting which path the pixels in a particular block will take based off of the paths pixels in the previous block took so optimizations are made, but they’re not always correct, especially for shaders where the paths change often from pixel to pixel.

These optimizations do speed things up compared to previous implementations, however they don’t eliminate the performance penalty. Often times you end up with a block of 4x4 pixels that was predicted to take a particular path, but in actuality 3 of those pixels take one path while the other one takes another and so both paths must be processed. So even with optimization you’re going to end up in scenarios where one or two pixels in a 4x4 or 8x8 block deviate causing both paths to be processed.

I can only speak to my own experience doing the trilinear interpolation stuff in my terrain shader that merges textures based on surface normal. Wrapping different sets of texture lookups in if blocks definitely improved performance on my nvidia cards.

That kind of branching used to cost a lot more.