Suggestions for v3.4

Hello fellow monkeys,

during the past month i have spent quite some time looking into where OpenGL went since jmonkeyengine was released and found before doing a vulkan renderer there might a bunch of things that can be added to introduce new features and reduce the driver overhead that comes with the current approach.
For example there were discussions about ComputeShaders, so i looked into them. What i found was unfortunately that implementing them required reworking the current buffers (VertexBuffer, UniformBuffer, ShaderStorageBuffer).
Semi-demotivated i looked around the internet for other features more modern OpenGL versions allow for and I have to admit I was overwhelmed by what you can do.
For example, you can tell OpenGL to count the number of samples that pass depth and stencil tests and make that number available to a shader without ever reading it back to the CPU (for dynamic tessellation, texture LOD eg).
Or you can generate DrawCommands on the GPU via ComputeShaders and draw them all at once with a single draw command on the CPU
Or obviously you could even combine those 2 examples and do full culling on the GPU
Or you can get a pointer to the GPU memory and write data to it directly without storing data in memory to copy it all to the GPU later (which most probably causes the driver to make one more copy if the GPU is still working on the old data) to stream data to the GPU more efficiently (more particles for example)
Or you can use some buffer as source for VertexAttributes (like inPosition) while updating the values in a ComputeShader to do animations purely on the GPU
So please can you add that to the engine? thanks a lot!

Jokes aside, so i forked the engine and what i ended up with is more than 11 000 additions and 300 deletions, thus I’m a little afraid it won’t make it into the core but here we go:

(NOTE: i have changed the readme of my fork to contain the same explanation):

ComputeShaders:

To make use of them, you first have to create a ComputeShaderFactory:

ComputeShaderFactory factory = ComputeShaderFactory.create(renderer);

then you can instanciate ComputeShaders using a .comp file or a String as source:

ComputeShader someShader    = factory.createComputeShader(assetManager, "Shaders/SomeShader.comp", "GLSL430");
ComputeShader anotherShader = factory.createComputeShader(SOURCE_STRING, "GLSL430"); 

Uniforms, Buffers and Textures as well as Images can be bound to the ComputeShader:

someShader.setVector3("Direction", Vector3f.UNIT_Y);
someShader.setFloat("Intensity", 2f);
someShader.setTexture("Tex", assetManager.loadTexture("path/to/tex.png"));

Because there is no j3md file or similar for a ComputeShader, uniforms cannot be bound to defines and thus defines have to be set manually:

someShader.setDefine("SIZE", VarType.Float, 1f);
someShader.setDefine("SAMPELS", VarType.Int, 32);

this however allows for Strings as defines too, example:

someShader.setDefine("READ_CHANNEL", null, "x");
//to use in glsl shader code like:
float val = imageLoad(m_Image, x).READ_CHANNEL;

When using the setTexture() method, the provided texture will be bound as texture and can be accessed in the shader using samplers (same as in fragment shaders), this can make use of lods (although textureLod() needs to be used as only FragmentShaders can do automatic TextureLod calculation), MinFilter and MagFilter and no Format has to be specified explicity as the sampler will
handle all of that. The downside is, that you can only read from and not write to the texture this way.
However when using setImage(), the provided texture will be bound as image and can be accessed via imageLoad() and imageStore(), but only 1 mipmap level can be bound, the format has to be explicity stated (and much less formats are supported) but it enables writes to the data. A setImage call can look like that:

int mipmapLevel = 0;       //bind max size mipmap level
int layers = -1;           //-1 means bind all layers (in case of TextureArray, TextureCubeMap or Texture3D)
boolean useDefines = true; //sets 3 defines in the shader: *NAME*_WIDTH, *NAME*_HEIGHT and *NAME*_FORMAT
someShader.setImage("Input", VarType.Texture2D, tex, Access.ReadOnly, mipmapLevel, layers, useDefines);

Buffers:

all buffers have static factory methods like:

UniformBuffer ubo            = UniformBuffer.createNewEmpty();
AtomicCounterBuffer acbo     = AtomicCounterBuffer.createWithInitialValues(0, 0);
QueryBuffer qbo              = QueryBuffer.createWithSize(16, true);
ShaderStorageBuffer ssbo     = ShaderStorageBuffer.createNewEmpty();
DrawIndirectBuffer dribo     = DrawIndirectBuffer.createWithCommands(commands);
DispatchIndirectBuffer diibo = DispatchIndirectBuffer.createWithCommand(command);

From an OpenGL point of view, different types of buffers dont exist. There is only buffers (which are just a block of memory) and then there is buffer targets which basically are views on that memory. Thus the intention behind the buffer rework was to allow for the same possibilities that OpenGL allows, ie have different views on the same memory. What it basically means is you can allocate some memory, create a VertexBuffer-view and a ShaderStorageBuffer-view on that same memory, use the VertexBuffer as source for vertex positions while binding the ShaderStorageBuffer to a ComputeShader to update the values before they are fetched in the vertex processing stage while rendering.
Since during buffer rework, not only the existing types have been reworked, but new typed have been added (like DrawIndirectBuffer or DispatchIndirectBuffer), DrawCommands can be created on the GPU as well (using a ShaderStorageBuffer-view and DrawIndirectBuffer-view on the same memory) or QueryObject results can be made available to Shaders without loading them back to the CPU (create a QueryBuffer-view and ShaderStorageBuffer-view on the same memory, tell OpenGL to store the query result in the query buffer and read it in the ComputeShader or FragmentShader via the ShaderStorageBuffer)

To create several views on the same memory, first the buffer to represent that memory, called UntypedBuffer, has to be created. During creation 3 decisions have to be made:

  • DirectMode or LazyMode

    LazyMode is the engines default behaviour, ie data is first send to the GPU once it is actually needed. This is usually fine but has some drawbacks like you cannot decide when the data is actually sent (which might cause stalls if too much data gets send at once when it is actually needed) or you cannot upload data in chunks. DirectMode means that all methods called on the UntypedBuffer will have their GL calls to reflect those changes called already once the method returns (of course that doesnt mean those calls already changed the state on the GPU).
  • GpuOnly or CpuGpu

    Specifies the memory usage of this UntypedBuffer. GpuOnly allocates memory on the GPU only, while CpuGpu will keep the data stored on the CPU as well
  • BufferData or StorageAPI

    BufferData is the default usage. It is supported by all platforms and specifies how the data will be used (static, dyanamic or streamed). StorageAPI is not supported on old hardware but is needed when you want to map the buffer

An example of how to create an GPU-Only UntypedBuffer in DirectMode with BufferData.StaticDraw and 2 views on it:

UntypedBuffer buffer = UntypedBuffer.createNewBufferDataDirect(MemoryMode.GpuOnly, renderer, BufferDataUsage.StaticDraw);
buffer.initialize(someSize);
ShaderStorageBuffer ssbo = buffer.asShaderStorageBuffer(null);
VertexBuffer posBuffer = buffer.asVertexBuffer(Type.Position, Format.Float, 3, 12, 0); //3 floats, the stride between the attributes is 12 bytes and their offset is 0 bytes

then the ShaderStorageBuffer ssbo can be bound to a ComputeShader to write data to it, while the VertexBuffer can be set on a Mesh to use the data as inPosition in the VertexShader (have to take into account the layout for ssbo).

PersistentlyMappedBuffers

Mapping a buffer basically means to get a pointer to the GPU memory and directly write to it instead of writing data to the CPU somewhere just to copy it all to the GPU later. This has the downside that you need to get that pointer from OpenGL and in most modern implementations this requires a client-server sync (all prior gl calls have to make it to the driver thread so it can give you the requested pointer even for unsynchronized mapping) and since any kind of sync is slow, persistently mapped buffers have been introduced which are mapped once and then kept.
A RingBuffer interface has been added with the SingleBufferRingBuffer and MultiBufferRingBuffer implementations that use Persistently Mapped Buffers to stream data to the GPU while providing synchronization and fencing out of the box to make sure you dont write data to a block of the buffer that the GPU currently reads from.

An example of how to use PersistentlyMappedBuffers:

//do once during initialization
int numBlocks = 3;
int bytesPerBlock = 1024;
RingBuffer buffer = new SingleBufferRingBuffer(RENDER_MANAGER.getRenderer(), bytesPerBlock, numBlocks);
            
//each frame grab the next block (potentially has to wait in case the block is still used by the GPU)
RingBufferBlock block = buffer.next();
//and write the data (up to 'bytesPerBlock' bytes)
block.putInt(123);

//now use that data for rendering or for a compute shader
...

//after the calls have been made that use the data on the GPU, mark the block finished
//this means when calling buffer.next() and it would return this block, it is made sure all GPU commands have finished that were called prior to calling finish() on this block
block.finish();

SyncObjects

SyncObjects are basically markers that can be placed in the GPU queue to later query their state, which is either signaled or not. This means you can do some GL calls, place a SyncObject and when checking its state you know that all work on the GPU has finished that was queued prior to placing the SyncObject in case the state returns signaled. In case it returns unsignaled you know the GPU is still busy processing the GL commands that were queued.

An example of how to use SyncObjects:

//do some GPU calls
...
            
//then place sync
syncObj = new SyncObject();
renderer.placeSyncObject(syncObj);
            
//the next frame or some other time later check the state
Signal signal = renderer.checkSyncObject(syncObj);
// if signal is Signal.AlreadySignaled or Signal.ConditionSatisfied, all prior work has finished
// else (signal is Signal.WaitFailed or Signal.TimeoutExpired), the GPU is still busy doing the work

QueryObjects

QueryObjects can be used to query the GPU for some specified values, for example: the number of samples that passed depth and stencil tests when rendering a mesh, or the amount of time that passed on the GPU between starting and finishing the query

//create query
GpuQuery query = new GpuQuery(GpuQuery.Type.SAMPLES_PASSED, renderer);
query.startQuery();
            
//now render some geometries
...
            
//stop the query
query.stopQuery();
            
//either: during next frame or some time later query the result
long samplesPassed = query.getResult();
//or : directly after stopping the query, store the result in a QueryBuffer
query.storeResult(buffer, offset, use64Bits, wait);

CameraUniformBuffer

One usage of UniformBuffers has been build into the engine: a uniform buffer that contains all camera-related data of the scene, it has the following layout:

layout (shared) uniform g_Camera {  
    mat4    cam_viewMatrix;
    mat4    cam_projectionMatrix;
    mat4    cam_viewProjectionMatrix;
    mat4    cam_viewMatrixInverse;
    mat4    cam_projectionMatrixInverse;
    mat4    cam_viewProjectionMatrixInverse;
    
    vec4    cam_rotation;
    
    vec3    cam_position;
    float   cam_height;
    vec3    cam_direction;
    float   cam_width;
    vec3    cam_left;
    float   cam_frustumTop;
    vec3    cam_up;
    float   cam_frustumBottom;
    
    float   cam_frustumLeft;
    float   cam_frustumRight;
    float   cam_frustumNear;
    float   cam_frustumFar;
    float   cam_viewPortLeft;
    float   cam_viewPortRight;
    float   cam_viewPortTop;
    float   cam_viewPortBottom; 
    
    float   cam_time;
    float   cam_tpf;  
};

And a new Material WorldParameter has been added: CameraBuffer. This means all camera related data has to be send to the GPU once only at the beginning of the frame and not once for each geometry that requires it. This does not include WorldMatrix or WorldViewMatrix etc ofc as those are different for each geometry. For a working example see jmonkeyengine/TestCameraUniformBuffer.java at master ¡ AKasigkeit/jmonkeyengine ¡ GitHub and the Material at jmonkeyengine/jme3-examples/src/main/resources/jme3test/ubo at master ¡ AKasigkeit/jmonkeyengine ¡ GitHub

Other Changes

Several things have been changed under the hood that dont directly expose new features, most notable:

  • VAOs (VertexArrayObjects) have been added. Basically before rendering a mesh, all VertexBuffers and if used also the IndexBuffer have to be bound, additionally VertexAttributePointers have to be updated and while those calls might not be as heavyweight as drawcalls for example, they still require native calls and sending the calls from the client thread to the server thread where they even require some checks to make sure those calls are valid. While those states change between different geometries (they use different buffers for their vertex data, at least with the usual setup), those states usually dont change within the same geometry between 2 frames (it still uses the same vertex buffers next frame). Thus this state can be stored, and reused when the mesh is rendered next time. This requires OpenGL 3
  • Texture- and BufferBindings have been changed. When setting textures or buffers, the renderer reuses binding units in case the requested texture / buffer is already bound and during initialization the OpenGL implementation is queried for its specific maximum number of binding points for the specified type / target so more textures / buffers can be bound at the same time. (this is not noticable to the user, but it increases the chances that a texture / buffer is bound already when it is needed, for that reason i also recommend using TextureArrays whenever possible to pack several textures into one texture object)

Capabilities

new Capabilities like Caps.ComputeShader, Caps.BufferStorage, Caps.MultiDrawIndirect, Caps.ImageLoadStore, Caps.QueryBuffer or Caps.SyncObjects have been added to check what the hardware the program currently runs on supports

Examples

Several examples were added under jmonkeyengine/jme3-examples/src/main/java/jme3test/buffers at master ¡ AKasigkeit/jmonkeyengine ¡ GitHub and jmonkeyengine/jme3-examples/src/main/java/jme3test/compute at master ¡ AKasigkeit/jmonkeyengine ¡ GitHub


So to sum it up, this is all quite developer-level and not like “throw 500 geometries at me and i handle everything from LOD over streaming the data to rendering it” but i hope it is something to discuss about

22 Likes

thanks for contributing :slight_smile:

there is a lot to read, but even me without big expierience in low-level OpenGL API understand much of it.

I only wonder if it might break some current rendering for some people or not,
anyway like you said, it will require discuss between more knowledged low-level API monkeys here.

Very cool. I will have to review in more depth later. Definitely need someone more opengl savy than me, but this looks very useful.

Thanks for your interest!

First of all, in case anyone is willing to dig really deep into the topic (now that i have seen/read through those resources, i can only recommend doing so, although it probably turned me more into an engine developer than an actual game developer, so be careful), here are some resources to start with

this one is from GDC 2014 and they already underline that the functionality (btw shown on a win7 system) is already available on back then mordern hardware. It is basically the first video to watch to understand AZDO (approaching zero driver overhead)

this one goes a little more into detail about multi draw (although only at the beginning, it later switches to nvidia-only extensions and vulkan)

while this one provides more information about buffer mapping

some additional links about modern rendering approaches

and a highlevel article including sources for GPU based particle systems:

now back to the code, and about how the changes might affect existing projects:

since as mentioned there is 11 000 additions and only 300 deletions, most things are added features that I’m quite sure dont break any project (like the compute shaders, syncObject, etc)

Then there is changes like the buffers, where i tried to keep everything working the way it worked, while adding missing features or providing alternatives.
So for example the VertexBuffer class has been changed, but in a way that is meant to be fully compatible:
you can still create the VertexBuffer the exact same way you did before (meaning allocate a buffer on the CPU, let jme send the data to the GPU once needed and keep the data on both places) just additionally you can create a VertexBuffer by first creating an UntypedBuffer and then use the untypedBuffer.asVertexBuffer(type, etc, etc) method to create a VertexBuffer that is a view on the UntypedBuffer (which you can create more views of), and once you got that VertexBuffer you set it on a mesh still the same way you did before.

And lastly there is changes that definitely affect all projects like the introduction of VAOs: if the renderer detects the OpenGL implementation supports VertexArrayObjects (everything OpenGL3+), then it uses them instead of rebinding the buffers and specifying the attributePointers each frame. While i have of course tested it with forcing the renderer to believe my system doesnt support VertexArrayObjects, that still doesnt guarantee there is no bug

Additionally there is changes that are not likely to break any projects, like the CameraUniformBuffer, because they are not fully integrated into the core of the engine. By that i mean the Unshaded material for example has not been changed to make use of it, and thus projects using the unshaded material are not affected by this.

As mentioned i added some Capabilities so you can always check if features are available on the hardware it is running on, which the renderer itsself uses for example for the VAOs

Some words about what i already did with this fork, to show some things have been used quite a lot already:

  • following this OpenGL FFT Ocean Water Tutorial #1 | IFFT Equation - YouTube implemented an OceanRenderingSystem using IFFT in ComputeShaders to calculate Displacement- and NormalMaps which are used by the Tessellation- and FragmentShaders to offset the Vertices and do proper lighting respectively

  • following this http://developer.download.nvidia.com/books/gpu_gems_3/samples/gems3_ch30.pdf implemented a 3d FluidSimulator and corresponding Raymarcher via ComputeShaders

  • implemented an OcclusionOctree that i used for my spatials instead of the usual Node-Geometry-Hierarchy, it heavily used GpuQueries and traversed the Octree during rendering, taking into account previous query results if present, potentially skipping huge nodes of the octree. I used it in my marching cubes project where the performance improvement was incredible (i could pause it and fly around rendering the geometries that were visible from the position when paused and basically everything like caves or behind mountains and hills was culled)

What i currently work on:

  • an MdiSystem (MultiDrawIndirect) that you can just throw geometries at and it will manage culling and lod on its own. It uses Persistently Mapped Buffers to stream transformation data (like world matrix), does Frustum- or if available HiZ-culling on the GPU calculating appropriate LOD on the fly and building a list of draw commands for the final geometries with corresponding LOD to be rendered with a single draw command from the CPU side.
    (Thus the little hint with " not like “throw 500 geometries at me and i handle everything from LOD over streaming the data to rendering it” ", im working towards it! :smiley:

And a slight note aside: like 99% of the public methods i added are documented (dont know what made me go that crazy), and i kept it up to date with the master branch, so if you’re unfortunately stuck in quarantine or got the rest of the day free for other reasons, and after watching the resources linked at the top though about a new super efficient ways of rendering millions of particles, or always wanted to speedup your marching cubes generation using compute shaders (and have the joy of debugging your spaghetti-mesh instead of spaghetti-code Coding Adventure: Marching Cubes - YouTube ), clone the repo and jump right into the adventure

NOTE: I am definitely not trying to get anyone to switch to my fork, i just feel it would be a huge help for the code if people played around and got familiar with it, providing feedback or maybe bug reports that i could fix, because I’m afraid nobody is super happy about the idea of reading 11 000 lines of additions

9 Likes

Out of tech talk(since im not specialist here)

Give time to core-devs first to read and analize all.(i might take time i belive) - is there some pull request btw?

The work you made for me sounds really nice, even if its still OpenGL(since mac dropped support it)
So i again thank you for giving chance to make engine to grow :slight_smile:

99% of public methods documented? sounds great :+1:

Below is just my personal opinion:

Assuming it will go into core(if core-devs will approve), you would need prepare some “Higher API tools/classes and tests” that could be just used by “default” people. Like this MdiSystem you talk about or “UnshadedNew”/“LightingPBRNew” (since we need know bugs, right?)

Since for example myself, i could maybe use something when writing own new Shaders/etc.
There are some shader specialists, but most people would like to use ready2use things.(and this people give real bug-hunting)

That’s were real bug-hunting would appear. It would mean a lot of work for you, but since you look like you are willing to maintain the 11k lines code, it looks fine.

I understand it would help debug code(that you use yourself as i understand), but it would also mean that you would be core-dev, with responsibility for maintaining this code. I think this maintaining-declaration from your side is needed, since probably noone will just jump in and maintain this immidietly.

1 Like

Not only a note for the OP but for everyone: If you plan to make such big changes, just get in touch with the devs first, so individual points can be discussed before having an implementation. That way you do less duplicate work/refactorings.
I understand that people might just want to implement it in the best way it works for their usecase or are on a coding spree though, but if you’re worried of not getting it into the core, it might be a sign to get in touch :slight_smile:

Your changes come at the right time, actually, as we planned on reworking the rendering for 3.4.
This however also means we might need to slow down integration to forge them into the new API/Code in the best possible way.

ComputeShaders

Was there a specific reason not to use the existing MatDef/Material system?

CameraUniformBuffer

I guess we might to assert here, that rendering code won’t change the camera values.
Also I gues this automatically works in multi camera scenarios, because those are just different viewports being rendered sequentially?

This is something @RiccardoBlb also worked on afaict.

Good for us, I guess :stuck_out_tongue:

I’m afaraid we have to read all those lines and/or at least open a hub topic for each feature to discuss the implementation/potential API. But that’s a good thing.

As I already outlined, we are working on a “ScriptablePipeline”.
Think of it like using a Builder Pattern/Java 8 Streams to describe the rendering process.
In the current codebase, it is a bit tedious to create a camera rendering the scene offscreen and then using the result as texture for a “CRT” Shader to be displayed on a monitor.
Everyone can however imagine how some pseudocode for such a pipeline would look like.

There we can benefit greatly from GPU only buffers (e.g. we’ve been discussing a point where each pipeline step had to move the data back and forth between CPU and GPU), and we can also look into parallelization via dependency analysis. (but this might remain an experimental feature and is mostly the pipe dream target)

Currently however, we’re still two projects behind, which is my fault for not pushing it properly, but I’ll try my best to keep up. For instance the next project would be the “spring cleaning”, where we try to spot bugs and clean up the codebase where possible, so we have a solid base before continuing with new features.

The other big thing we’ve been thinking about is a 2020 Showcase of the Engine, to show what we’re capable of. This serves multiple things: Debugging/Seeing if everything is working for everyone, as a benchmark and also as a tutorial/example code for such scenes, besides all the PR effects etc.
Thus it would also be a great addition to have a Version 2 with the new Pipeline and your features to see how much we can improve code verbosity and speed.

Talking about speed I have one question, though: Do you have a good use case where the driver overhead is visible? Because most monkeys so far could even get away with quite a few drawcalls.

3 Likes

Thanks for your feedback!

You are definitely right, it needs some higher level systems (like the examples of particle systems or gpu culling i mentioned earlier), but as you said i am working on it.
Those features would not necessarily need to be in core (to me it seemed like people tend to share additions in the jmonkey store nowadays for fast releases, higher variety without encumbering the core etc), while the features currently added definitely had to change the engine and could not be put into a separate project (ofc it could be a separate project if you used reflection for example to get the render context of the renderer and did your own android / desktop etc detection for proper gl calls etc etc but i guess thats not the way to go)
Then again there is cases where for example you’re working on an ai system that you want to improve and then ComputeShaders and lower level buffer management might be all you need and that system wouldn’t be a canditate for core because it’s too specific. Its the same with tessellation shaders: There is no usage of them in the engine, yet im super happy they have been added because they can still be of good use for other game developers.

Yes i am definitely willing to maintain this code, it is not like i had thrown that code together in the last 3 weeks and was now about to forget the details again already :smiley: . Instead i started working on it already one year ago and have used and extended it with every project i did (thus there is a bunch of small changes that i had to do during any one of the projects like Texture.WrapMode.BorderClamp and a Texture.setBorderColor(ColorRGBA) to set the color that is returned by the sampler if the uv’s are out of bounds, or the stencilMask and stencilRef in the renderState that allow for more usages of the stencil buffer, turned out really useful for deferred rendering for example), so I am not going to stop working on and with that code anytime soon anyway.

Well ComputeShaders dont fit the standard rendering pipeline, thus there was no point for me in adding them to the MatDefs, from a code perspective however i reused as much as possible (like Shader and ShaderSource ofc, the renderers capability of updateUniform, updateBuffer etc)
ComputeShaders are more like AppStates and not like Controls: You might want to use them for more global stuff instead of running them for each Geometry in your scene

When viewports are rendered, the renderManager calls its setCamera method for the camera of the viewport to render, which calls its internal setViewProjection() method which updates the state for the UniformBindingManager. The UniformBindingManager grabs the CameraUniformBuffer related to that camera or creates one in case there is none, updates its values and keeps a reference to that CameraUniformBuffer. From then, whenever a shader has its uniformBindings updated via UniformBindingManager.updateUniformBindings() and the Camerabuffer was declared in the MatDef WorldParameters, this CameraUniformBuffer is bound to the shader.
So that means, if you have a SceneProcessor attached to a specific viewport, change the camera to render to a another FrameBuffer, render some Geometries and then switch back to the first Camera to continue rendering the other stuff, it will work as expected. However if you change the values of the first Camera before setting it again on the RenderManager, then its values will not get updated. In other words: you can change cameras as much as you want but you cannot change the values of the same camera after the first geometry using this camera has been rendered. You can just create another Camera in that case, but before you end up with 50 cameras i wonder why one would change the camera values 50 times per frame after rendering has started. then again might be ofc there is an obvious use case and i just never came across it

please dont, even the comment on the current query object implementation for the DetailedProfilerState claiming the result is guaranteed to be available the next frame are not correct unfortunately

So out of curiousity, is that code somewhere i can see it? Because as stated my features are more like a toolset and especially when the rendering pipeline is under rework currently, i should stop working on the MdiSystem because this one is the highlevel thing to use by the enduser.

can you elaborate a little on this please?

Well i gues this is a larger topic :smiley:
First, we run on java, thus we pay the additional overhead of a native call per GL call, this one is right on our mainThread. Then on modern driver implementations there is a client - server model, so the next thing that happens, is the call gets put into a buffer, also on our mainThread. Thats it on our mainThread, thats all we see on the left column of the DetailedProfilerState.
Then every now and then when there is enough calls in the buffer on the client thread, the buffer gets kicked to the server thread, which starts processing it, doing some checks to see if the call is valid (for example when binding buffers, those checks are typically delayed until an actual draw call is made which in our case happens couple calls later, or has to compile a shader because some define changed) and puts it into another buffer. This happens on the server thread, while still on the CPU, it is not the mainThread in our application and thus not visible in the profiler.
Finally, when the server thread has also build up enough calls, they get flushed to the GPU to be actually processed.
Now the right column of the DetailedProfilerState uses gpu queries to query the time on the gpu that happened between the start and the end of a call. On the Bucket level for example, the time between right before the first geometry of the bucket was setup for rendering and time when the step is finished which is once the next step starts, thus, after all geometries of the bucket have been rendered (and even including some depth changes for example for the next bucket EDIT: im wrong on that one, the depth values are changed right after the profiler is told about the new step, not before it, the rest is valid though)
So the problem with this is, the actual driver overhead is not directly visible to us,although it kind of shows up on the right column in an increased amount of GPU time because the gpu query does not only count the time when the GPU is busy, but instead the time between the start and the end or the query, and if the GPU was not busy because the driver had to do so much validations for every single object and could not send actual draw commands fast enough, then the GPU is rendering some draw commands, waiting, rendering more, waiting and all that counts towards the GPU time.
Ofc if youre fragmentShader bound or vertex shader bound, then this also shows up on the right column, thus its hard to tell when youre actually driver overhead bound

3 Likes

@Samwise - wow! This is great to see!

I guess I don’t see the need for this - this is a very powerful toolset for graphics/shader devs, but by nature it’s not something that beginners would make use of unless they were using a built-in shader that used it without even knowing it. The folks who would be looking for the features this offers are the same folks who won’t likely be using the default materials and who won’t be scared away by a GPU-level API.

In any case, I’m thrilled to see progress towards modern OpenGL features. It’s possible that this AZDO work will also be beneficial for possible future Vulkan rendering systems as well.

2 Likes

@danielp what i mean is that we need “more testers” to have “correct bug-reporting”. Each hardware is different and each person use different engine settings.

And this can be only achieved by providing for example “built-in shader” like you said :slight_smile:
All this “things” no need to be in core, can just be in JME Store, but need provide.

Gotcha, that makes sense. :smiley:

I have a wide range of gpu hardware and would be willing to run tests if they are written.

3 Likes

Thanks a lot, that sounds really awesome!

I have just pushed another Test to https://github.com/AKasigkeit/jmonkeyengine/blob/master/jme3-examples/src/main/java/jme3test/buffers/TestMultiDrawIndirect.java that uses multi draw indirect to render a bunch of quads (16000) with a single draw command and uses persistently mapped buffers to stream the translation of the quads each frame to move them independantly. (Sounds like instancing, but it doesnt have to be all quads, could be totally different shapes each)

Another test at https://github.com/AKasigkeit/jmonkeyengine/blob/master/jme3-examples/src/main/java/jme3test/buffers/TestShaderStorageBuffer.java does ray marching via a compute shader on a bunch of spheres stored in a shader storage buffer which should get automatically layed out although using “layout shared” (you cannot move the camera, its just a test case :smiley: )

And a last one worth mentioning is https://github.com/AKasigkeit/jmonkeyengine/blob/master/jme3-examples/src/main/java/jme3test/buffers/TestQueryBuffer.java that uses gpu queries to color cubes darker when they cover less fragments, without reading the query result back to the CPU
(it does it in a quite inefficient way because it creates one material per geometry, but its just a testcase for the gpu queries, querybuffer, and how to make them available to the shader via shader storage buffer)

I will come back to you tomorrow (its quite late here) to see how i can provide better test cases

2 Likes

Maybe you are interested in jme3-testing, our approach to testing rendering, i.e. we can automatically detect if the render fails or looks different on a different device.

So far it only runs on OpenGL 2.1 on Github Actions, because that’s the only version the software render supports, but the strictness of mesa is a good thing for a few of our shaders. Besides, when run locally, it supports what your host supports.

That sounds interesting. As i was refactoring some code over the weekend i did not yet have time to look into it much unfortunately.

that is a little unfortunate though, because none of the features work with OpengGL 2.1 except for the GpuQueries and then even only querying for the samples passed. I will however prepare 2 or 3 testcases for that test library that use a different set of the features so it can be run locally, just give me couple more days please

In the meantime I have been refactoring some parts and fixed the layout detection for nested glsl structs that didn’t work before. I have added 3 more testcases:
one at https://github.com/AKasigkeit/jmonkeyengine/blob/master/jme3-examples/src/main/java/jme3test/buffers/TestShaderStorageBufferNested.java that shows the layout detection for nested structs:
basically you can declare structs and a buffer in your compute shader like
glsl code

struct Color {
     float values[3];
};

struct Sphere {
     Color color;
     vec3 pos;
     float radius;
};

layout (shared) buffer m_Spheres {
     vec3    ambientColor;
     Sphere  spheres[];
};

and your java side code:

private static class Color implements Struct {
    @Member private float[] values; 
}

private static class Sphere implements Struct {
    @Member private Color color;
    @Member private Vector3f pos;
    @Member private Float radius;
}

and send it to the GPU using

Sphere[] spheres = ... //some array of spheres you created

ShaderStorageBuffer ssbo = ShaderStorageBuffer.createNewAutolayout();
ssbo.registerStruct(Color.class);
ssbo.registerStruct(Sphere.class);
ssbo.setField("spheres", spheres);

compShader.setShaderStorageBuffer("Spheres", ssbo);

and another test at https://github.com/AKasigkeit/jmonkeyengine/blob/master/jme3-examples/src/main/java/jme3test/buffers/TestBlockLayout.java that shows how to get the layout information of a shader storage block to layout the vertex attribs accordingly when using compute shaders to update the vertex attributes

and the last one at https://github.com/AKasigkeit/jmonkeyengine/blob/master/jme3-examples/src/main/java/jme3test/compute/TestCombined.java that combines some of the features in a totally pointless way:
It uses a single quad with a bunch of per instance position and scale attributes layed out according to the layout of the struct in the ComputeShader that is used to update those attributes. then it uses a DrawIndirectBuffer to draw all of those 1 million quads with one draw call using MultiDrawIndirect and prints the number of samples that passed the depth test to the screen using GpuQueries. the per instance attributes are initialized on the CPU (with a custom “Instance”-class that implements “Struct”) and send to the GPU with the automatic layout detection. the ComputeShader run each frame to update the translation and scale of the quads is run using a DispatchIndirectBuffer to source the work group count from the GPU

I have also added Caps checks before the shader block layouts are queried and delayed querying for the layout until a buffer is actually used with the shader

4 Likes

Yeah, we are also unhappy about this limitation, but there isn’t much we can change, I think they work on OpenGL 3 support, but we can always run it locally, at least.

I note that pretty much every piece of code that you’ve referenced so far has been committed to your master branch. This makes me just a little worried about the practicality of getting your changes reviewed for inclusion in the official JME builds. Especially since you’ve already said that you’ve written ~11kloc. That is a huge amount of code to review, and the difficulty/headspace needed to do a proper code review seems to roughly scale exponentially with the amount of code that needs to be reviewed.

I’d recommend that any changes that you are contemplating that you would like to contribute back to the engine should be developed in their own branches, based on the current engine HEAD. The ‘nested glsl’ changes, for example, sound like they are just about right to stand on their own as a Pull Request.

It might, in the interests of expediting discussion and review, also be useful to break up that 11kloc change into several separate branches. (though I’d appreciate feedback from the engine team.) The headings in your original post seem like good points to start.

11k is much, but its also a big feature. (question is if it was possible to write it in much less code lines.)

About OpenGL 2.1, well, could just provide this features when higher API is used.

First of all, thanks for your input!
The changes (and by that i mean the 11k lines) can basically be broken up into GL calls (adding methods to the GL, GL2 etc interfaces, adding implementations to the LwjglGL etc classes), comments and empty lines (its still source code) which together taks up a reasonable amount already (like a single getter method with proper comment is 10 lines already) and then there is the “actual code” ofc, but which is far less than 11k lines.
Another huge part of the code is the several classes for the different buffer views, namely AtomicCounterBuffer, DispatchIndirectBuffer, DrawIndirectBuffer, ParameterBuffer, QueryBuffer, ShaderStorageBuffer and UniformBuffer which are basically all just putting data into a buffer, however in specialized ways, so this makes for quite some lines of code too

The GL calls are also a problem when seperating it into several branches because they are so fundamental, some of them are shared by few features. On a higher level the RingBuffer implementations for example also requires the SyncObjects and the buffer changes. Or the GpuQueries are needed by the QueryBuffer which is part of the buffer changes, so there is a bunch of dependencies
I have however tried to keep it up to date with the current master branch of jme so a compare can be seen here:
Comparing jMonkeyEngine:master...AKasigkeit:master · jMonkeyEngine/jmonkeyengine · GitHub (its currently not compatible because of a commit from a few days ago, but i will fix that the next days) and you can still click “files changed” to see the differences.

maybe we should first discuss the features that it would offer, perhaps taking my testcases as reference to get a feeling for the features, then see how we would like to have them in jme and then see how my implementation fits that or not, so im up for any further input. maybe people have implemented some of the features in their own forks already and can give feedback on how they did it or what they used it for to make sure the final implementation covers all cases

1 Like

Sailsman is right, in general, the good thing is that git supports extracting the history based on a folder/file. Because that’s another thing to consider, we cannot just like merge everything at once but will do so on a per feature basis.
But since the state is how it is, maybe the said extract-folder approach might be worth it either now or in the end. The trouble starts when those branches depend on each other etc.

1 Like

@Darkchaos what are you using for software gl? Llvmpipe has up to opengl 4.5 support now, but a lot of it came in the last 1 year. Maybe your image uses an older os with an older mesa.