Handful of GPU and Shader Questions

Hello fellow jMonkeys,

since it’s been some time that i wrote my blocky voxel mesh shader i figured it might be time to look over it again and i stumbled upon some questions, some of which are directly related to this shader, other are more GPU related in general.

first question:
i use the ‘Normal’-buffer for each vertex of a face i store a number representing one of the 6 directions a face can be facing, like 0 for face with normal pointing towards negative x, 1 for face with normal pointing towards positive x and so on. (that is to reduce the data that has to be sent to the GPU, i only need 1 byte instead of 3x3x4=36 bytes (3 vectors tangent, bitangent and normal with 3 floats each and floats are 4 bytes)

now i read that GPU is fast - no exact quote but quite like that - “when executing a program in parallal that is considered ‘one program’ on the GPU”, something in the meaning of when branching needs to be done, one branch might have to execute first before any calculations can be done in programs that chose the other branch
so i changed my vertex shader from

if (inNormal < 0.5) {
      t = vec3(0.0,0.0,1.0);
      b = vec3(0.0,1.0,0.0);
      n = vec3(-1.0,0.0,0.0);
} else if (inNormal < 1.5) {
      ...
} else if...
 
TBN = mat3(t, b, n);

to

//some lookup tables
const vec3 normals[6] = vec3[6](
                            vec3(-1.0, 0.0, 0.0),
                            vec3( 1.0, 0.0, 0.0),
                            ...
const vec3 tangents[6] = vec3[6](
                            vec3( 0.0, 0.0, 1.0),
                            vec3( 0.0, 0.0,-1.0),
                            ...
const vec3 bitangents[6] = vec3[6](
                            vec3( 0.0,-1.0, 0.0),
                            vec3( 0.0,-1.0, 0.0),
                            ...
float weightZP = step(32.0, inNormal);
float weightZN = step(16.0, inNormal) - weightZP;
float weightYP = step( 8.0, inNormal) - weightZP - weightZN;
float weightYN = step( 4.0, inNormal) - weightZP - weightZN - weightYP;
float weightXP = step( 2.0, inNormal) - weightZP - weightZN - weightYP - weightYN;
float weightXN = step( 1.0, inNormal) - weightZP - weightZN - weightYP - weightYN - weightXP;

vec3 t = vec3(0.0,0.0,0.0);
vec3 b = vec3(0.0,0.0,0.0);
vec3 n = vec3(0.0,0.0,0.0);

t += tangents[0] * weightXN;
b += bitangents[0] * weightXN;
n += normals[0] * weightXN;

t += tangents[1] * weightXP;
b += bitangents[1] * weightXP;
n += normals[1] * weightXP;

t += tangents[2] * weightYN;
b += bitangents[2] * weightYN;
n += normals[2] * weightYN;

t += tangents[3] * weightYP;
b += bitangents[3] * weightYP;
n += normals[3] * weightYP;

t += tangents[4] * weightZN;
b += bitangents[4] * weightZN;
n += normals[4] * weightZN;

t += tangents[5] * weightZP;
b += bitangents[5] * weightZP;
n += normals[5] * weightZP;

TBN = mat3(t, b, n);

and instead of sending 0, 1, 2, 3, 4 or 5 for the faceDirection, i send 1<<faceDirection, so 1, 2, 4, 8, 16 or 32 as byte in the Normal buffer
i did that because now every vertex shader program runs the exact same code and no branching needs to be done at all
and guess what, i got 0 difference in performance, why is that? what is concidered efficient code on the GPU?

i always read that code thats efficient on the CPU is not efficient on GPU and vice versa and now that i thought i got what that means i’m confused i see no difference in performance
btw there is millions of vertices using this vertex shader in the scene i use to test the performance difference and using the DetailedProfilerState i see that flushQueue - opaqueBucket takes by far the most time of my frame time and almost all objects in the scene use this shader

second question
i use the steepParallaxMapping that jme uses but copied that function and moved some of the calculations from the fragment shader to the vertex shader, namely

vec2 vParallaxDirection = normalize(  vViewDir.xy );

// The length of this vector determines the furthest amount of displacement: (Ati's comment)
float fLength         = length( vViewDir );
float fParallaxLength = sqrt( fLength * fLength - vViewDir.z * vViewDir.z ) / vViewDir.z; 

// Compute the actual reverse parallax displacement vector: (Ati's comment)
vec2 vParallaxOffsetTS = vParallaxDirection * fParallaxLength;
// Need to scale the amount of displacement to account for different height ranges
// in height maps. This is controlled by an artist-editable parameter: (Ati's comment)              
parallaxScale *=0.3;
vParallaxOffsetTS *= parallaxScale;
vec3 eyeDir = normalize(vViewDir).xyz;   

float nMinSamples = 6.0;
float nMaxSamples = 1000.0 * parallaxScale;   
float nNumSamples = mix( nMinSamples, nMaxSamples, 1.0 - eyeDir.z );   //

if im not mistaken thats basically 3 square roots that are calculated for each fragment and i was expecting a performance gain from moving that to the vertex shaders, but guess what, exactly 0 fps difference again. So how does that make sence?

third question
this one is not shader specific but still GPU related
I got a notebook with a GeForce 840M and when i run dxdiag i can see 8GB storage for that card, so i guess it uses my ram which makes me wonder: does that mean for such setups it takes no time to send for example a meshes buffers to the vram because it is shared memory (so it is already where it has to be to be accessible for the GPU) but on the other hand texture lookups and such take longer because it takes longer to get data from ram than from vram on a GPU?

last question
when exactly does a shader need to be recompiled?
i first though that is whenever a material parameter changes, but from what i get now material parameters are send to the gpu and updated when necessary but are not compiled into the shader meaning the shader does not have to recompile but the GPU has to lookup the values in vram. so does a shader only have to recompile in the case when a material parameter changes that is bound to a define or are there other cases?
EDIT: so for performance, i should bind all material parameters that i expect to never or really rarely change to defines and use these defines in the shader instead of looking up the values in vram? while for those that i expect to change somethat frequently i should look them up in vram?

EDIT: no i dont have vsync enabled :smiley: and the bottleneck is not on the cpu side, as soon as i toggle parallax mapping off (which changes a define and then does not run any parallax mapping functions) fps raises from 40 to 46

Many thanks in advance and many greetings from the shire also,
samwise

1 Like

Regarding the branching, GPUs achieve their speedup by doing several things in parallel instead of one thing at a time. When shaders branch, the same core may be running code in both branches. To do this, the core actually executes all branches - but for the shaders where the “if” check was false, the calculation has no effect on their data. This is why you didn’t see any speed difference - you manually unrolled the branches very much like the GPU would have. Modern GPUs are extremely powerful - I’d suggest always writing code that’s clear and makes the most sense to you to save yourself time, and then let the GPU’s optimizers do the hard bits for you (unless you can show that a specific “hot spot” is causing unacceptable performance loss).

For the fragment shaders, my guess is in both cases there’s enough other stuff going on that your square root doesn’t make any difference even when done per-fragment. For a modern GPU, a square root is no big deal. As long as you get the results you need, I don’t see any reason not to put it in the vertex shader, but it seems for your case one way or the other doesn’t really matter.

I’ll pass on question #3, but regarding your last question, the shader only needs to recompile if the defines change - as long as you pass your data as uniforms or buffers of some sort, it won’t recompile.

1 Like

As always, thanks for the quick reply!

ok that explains a lot although that is sort of a bummer because it means i currently see no way to optimize the shader when the GPU is that good at optimizing my code :smiley:
or well i can make some of the material parameters a define because they are just some settings that never change during runtime but im afraid i wont notice a speedup here either

or am i actually wrong with the assumption that the shaders mainly influence the duration of time spent in the flushQueue - opaqueBucket step? i guess the number of drawcalls has a huge impact too but since i’m afraid i can no more reduce it i thought i can speed it up with making the shaders more efficient

Probably number of draw calls is your biggest bottleneck.

Other things have already been covered.

I don’t know if switching material parameters to defines will matter too much. Where it can make a different is when you have #ifdef based on those values. Then you eliminate entire sections of the shader… that can make a big difference. (For example, with no alpha and a #ifdef related to alpha discard, you can avoid including the if( blah ) discard; line… which lets the shader run more optimally. Having discard in your shader puts you on a different path is my understanding… a lot of times it’s worth it but not if the result is predefined.)

Reducing draw calls is easy, though. Make bigger chunks. :slight_smile:

1 Like

Well, on the bright side… You don’t need to spend your valuable time optimizing shader code because the driver/GPU is already very good at it. :wink:

Draw calls will dominate your render time like no other. If you had extremely expensive shaders, doing some manual work to clean them up might improve your performance - but in a geometry heavy scene like you have, my bets are on draw calls swamping your render time so heavily that shaders don’t make any measurable impact.

If you have material parameters that rarely or never change, you’re better off putting them in defines - like @pspeed said, it cuts branches (and in his example leaves your shader eligible for early Z discard), and it also saves you from flushing parameters to the GPU that aren’t needed.

@pspeed Also of course thanks for your reply!
checking if i can use some ifdefs dependant on some defines is a nice idea but im afraid i cannot use it for the discard. i’ll see if i can use it in other places though

EDIT: i guess i might be able to use it for the discard, i just need to make 2 materials out of this matdef one with discards checks turned off via material parameter bound define and one with discard checks turned on and while meshing i just need to keep track if there was a block containing transparency and if so use the material containing the check, otherwise use the material that has the check turned off. since most chunks dont contain transparent blocks maybe this gives some fps improvement
i got to work tomorrow so i cannot try it out now but i’ll test it tomorrow evening

and yes, i ended up with 32x32x32 instead of 32x16x32 but i cannot make the chunks larger because then meshing / calculating collision shape take too long

just what made me think its the shaders instead of the drawcalls (at least to some degree) is that i see the number of objects increase from 1200 to 1750 when i toggle the DetailedProfilerState on while fps drop from 40 to 39 only

i have to mention though, the setup i use with the notebook and gefore 840m gives me 170 fps for the simple blue cube example when in fullscreen at 1080p so i guess should not complain about the 40 fps in my scene and rather get a more representative setup and do tests on that

In regards to object count/draw calls.

A chunk is a column of cells in this example. That’s what I called them.

When I came upon this problem I created each chunk as normal but at the end when all cells are created (and thus the chunk is complete) I merged the meshes so they became one single scene object.

Then when a change occurred I detected which cell changed (simple position bit shifting), only re-generated that cell mesh (set the cell to dirty), re-used the others that were already generated and didn’t change, and then merged again.

Your generation times will remain virtually the same (mesh merging isn’t expensive cpu-wise) and your object count gets reduced by however many vertical cells you have.

Edit: and to be clear, you are using a single material, right? As in one instance of the material and not multiple instances of the same material.

But it can be incredibly expensive memory-wise.

Block-world games almost always run up against memory limits before anything else. So the tactics need to be a little different.

Furthermore, separate meshes mean better culling.

To OP, do you put your transparent and solid stuff into the same mesh? You may want to stop that. Have n Geometries per chunk where ‘n’ is generally 1 opaque mesh and maybe 1 transparent mesh.

(Mythruna actually has a geometry per texture per chunk.)

i actually started off with chunks (what you call cells) with a height of the full map size. however for culling reasons (and for meshing speed, although your approach would basically have them same speed for updating a mesh that the subdivided column (what you call chunks) has) i decided to divide the map vertically, too

The cave system is more complex with bigger caves and more cave ways than other known blocky voxel games, so there is a point in culling them and i actually got a performance boost back when i implemented occlusion culling
(although i have an idea how to - hopefully improve - the culling efficiency, i’ll test that somewhen the next days / weeks)

@jayfella onto your question, sure i am using a single instance of that material, but as i mentioned i plan to now use 2 instances of that material, one with discard checks turned on and one with discard checks turned off, as i hope i get a little improvement by saving the discard checks

regarding what @pspeed says, i do put the transparent stuff in the same mesh as the solid stuff, but only because the transparent stuff is either fully opaque (like the usual case) or fully transparent (in which case the fragment is discarded and does not write to depth buffer etc), how could i get a performance boost by seperating these meshes?
although i dont have transparent blocks in each chunk, in the worst case it might double the number of drawcalls

when you tell me mythruna has one geometry per texture per chunk, how many chunks do you have in viewdistance?
like when i’m told drawcalls are most probably my bottleneck, how can i get better fps by using approaches that effectively increase the number of drawcalls needed to render the scene?

Many thanks in advance already for any more pointing me into directions and helping me out as always,
samwise

Maybe a screenshot with stats enabled would yield some information.

32x32x32 chunks, 5 high with a max view distance of 192. So a 13x13 area of chunks… 13x13x5 = 845 x however many active materials there are in a chunk… probably averages about 6?

But each of those 845 chunks is a node and the geometries are children… so only the ones in view actually matter for draw calls.

@jayfella allright here is one screenshot with postprocessing effects turned on and once with postprocessing turned off (that is with minor post processing only) to show that it makes a small difference while the chunks actually influence the performance much more

and

so the recommended way to reduce the time spent in flushQueue - opaqueBucket is to reduce the number of drawcalls since shader optimizations would probably be too minor?

@pspeed i also use 32x32x32 chunks, and i also use a seperate node per chunk while these nodes only ever contain exactly one geometry for the voxelstuff (and when turned on, one additional point mesh for vegetation shader if there is blocks that need vegetation in this chunk)

since for the occlusion culling algorithm it makes sence to do the frustum culling also already (to reduce the number of chunks to check instead of consider them all visible and later frustum cull the ones that are not in view) so i toggle these nodes cull hints manually and dont need to do a single frustum check for them in the main thread

with a field of view of lets say 90 degrees that should be around 845 * ~6 / 4 = ~1260 geometries?
that is about what i end up with when the waterreflection is set to be the sky node only and not reflect any chunks
just the wiki states something like “keep the number of objects around 200 to give a responsive feeling” or similar and i was wondering how could i ever stay below 200 objects? :smiley:

On Android, 200 is probably a nice high target.

On desktop, I get away with thousands pretty easily. Especially if they are well organized.

thats actually really awesome to hear :smiley:
just to clarify, by “well organized” you mean good for culling and distance ordering?

i guess i should just continue with development and consider it optimized enough for the moment, although that is hard as i love thinking about optimizations

1 Like

Yes, I meant good for frustum culling. It’s very fast and cheap… the only requirement is a well organized scene graph.

1 Like

Yes i noticed that although it still takes the most time of all checks in the occlusion culling and i could speed it up by doing a dot product first between the vector from camera to chunk center and the vector the cam looks into and if its greater than 0.75 i consider the chunk inside while if its smaller than 0 i consider it outside and only if its between 0 and 0.75 i do the jme frustum check (that is i copied it to make sure it uses the same 6 frustum planes throughout the one occlusion culling check run even if it takes longer than one frame for those than have the high fps) although that dot product check only works since i dont need to check against the far plane and after the occlusion culling is done i always add the directly surrounding chunks to the set of chunks that are considered visible

i’ll now first test if using a second instance of the material for the chunks that contain transparent blocks with the discard check turned on while the opaque-only chunks use an instance with this check turned off via define makes for any improvement and later on see what i get from seperating the transparent blocks from the opaque blocks geometry-wise

regardless of the result, thanks again already!

Serendipitously, after this discussion, I saw on Twitter that someone asked about using "If"s in their shader. One of the responses was this link:

…while I haven’t read it yet, it did look interesting and seems well recommended.

It’s still probably a case where you should set yourself up to enable/disable certain things so that you can test whatever you try in a variety of situations and hardware. For Mythruna, I added lots of such options that users could manually enable/disable from the in-game console. For every case that made lots of sense there was always the one guy who had exact opposite results.

3 Likes

@pspeed thanks a lot for the link, i read through it on the way back from work already as i could not wait.

there is one suggestion that i would be happy to hear opinions on in case anyone has experience with it. the text says to do calculations in the vertex shader if possible (that is if vertex resolution is enough to still give good enough results when using interpolated values on the fragments)
that is in the way that i did already with moving the square root calculations to the vertex shader

in my blocky world, each face is made of 4 vertices, that make for 2 triangles since 2 vertices are shared. now that obviously doesnt allow me to move the per-fragment normals or resulting specular calculations to the vertices since it would defeat the “per-fragment” resolution.

now the idea is to use a tessellation shader, tessellate the faces that are very near really often and lower the tessellation levels the further the face is away from the camera (in such a way that a face ends up with as many tessellations as are needed to make one tessellated triangle about the size of like 4 pixels) as i hope that would effectively reduce the amount of calculations to 1/4th (at least those that were previously done in the fragment shader) while still looking almost as detailed as with doing the calculations per fragment

Just as my viewrange is somewhat high and i usually got around 3 to 9 mio vertices in a scene (so thats an average of 6 mio vertices which is already 3 times as many as there are fragments to calculate in a fullHD resolution) im not at all sure that this would effectively reduce the amount of calculations
although there probably is some overdraw raising the fragment shader invocations to above 2mio i dont think each fragment is drawn 3 times on average which would mean doing it in the fragment shader effectively does less calculations

Does anyone have experience with tessellation shaders and their performance or maybe followed that aproach already?
I used a geometry shader for my first grass approach but the performance impact was huge (maybe because of the overdraw with the single grassblades?!) so i wonder if the tessellation shader maybe is something to consider for vegetation as well as my voxel-related meshes

Are your cubes not flat? Why wouldn’t the normal be the same across the surface?

well they are basically cubes, like they consist of 4 vertices, but they use parallax as well as normal mapping giving me really detailed surfaces.
i followed an article on learnopengl.com and added a block with the texture they provide for testing and those brick blocks look exactly like that in my game:
https://learnopengl.com/img/advanced-lighting/parallax_mapping_plane_heightmap.png

and i really really really want to keep that, even the tree trunks and the simple stone blocks look awesomely detailed especially when moving the camera

EDIT: or do you mean its impossible to do texture lookups in the tessellation evaluation shader so i cannot interpolate between them for the fragment shader?