Fast dynamic meshes

Hi, I need to update vertex buffers as efficiently as possible. Does anyone know of some good example code to look at?



I can also describe the meshes, if someone has got other advice to give.



They aren’t very large, just about 150 verts or so, and each vert has 3 floats for pos, 3 for texcoord1 (volumetric coords), 2 for texcoord2 (flat noise coords), and 1 for texcoord 3 (opacity). All in all it’s about 1400 floats.



The vertex count never changes, nor does the indices.



It’s not streaming buffers, just regular dynamic ones. The meshes are only updated if they are within the camera frustum (a check is made), so it may happen every frame, but it also may not.



Thanks //Androlo

This might be overkill but what about OpenCL/OpenGL context sharing? If you move the vertex manipulation code to OpenCL kernels you save the memory transfer round-trips and you can manipulate the VBOs directly in the GPU memory. You don’t need to add new external library dependencies as lwjgl supports opnecl 1.1 but you must tweak the JME internal code to obtain the GL context.



I made some opencl tests using lwjgl and it seems to work fine, but I did not try the context sharing thing so I can not comment on it.



Nvidia and AMD have good demonstration sample codes…



But again, for 1400 floats It is very questionable, probably sending the data every frame will suffice.



If you want to further investigate the solution Google will give you tons of info, here is something to start with:

http://www.dyn-lab.com/articles/cl-gl.html



Regards,

Remorhaz

2 Likes

the batchNode or the skeletonControl does that in the core if you want an example

2 Likes

@nehon

@remorhaz

Great!!! I was hoping for exactly this kind of stuff. Thanks.

@remorhaz

gl/cl interop is clearly the way to go. 1400 isn’t much but there are 12 of those meshes (at most, there is a culling system) and the update process itself is heavy, and typical gpu stuff. Updating the verts is in fact more of a bottleneck then updating the buffers.



Also with more efficient updates it would be possible to use more verts, which would reduce some of the artifacts that the original system suffers from.



EDIT: And it would also pave the path for more volume rendering stuff.

@androlo



I am glad you found it useful. There are some cons though, like Intel does not have an OpenCL driver for their GPUs until the very latest ivy bridge stuff. This means the solution will not work on all hardware that supports OpenGL. But if you target mid and upper range hardware it will not be a problem.



I found a complete simple example that might be helpful:

http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=99&Itemid=150

The older intel cards will not be able to run these kind of simulations anyways (because they suck), so it doesn’t really matter.

This is from a prog using openGL/CL inter-op with LWJGL. It generates the texture using a CL kernel and then openGL to display.



I now have a framework for creating an openCL system, using inter-op, and a “jME-style” wrapper class for cl progs (it reads .cl kernel files, kind of like how shaders work). There’s a pretty descent error handling system built in as well (checking for different platforms (whether it’s 64/32 bit etc), and configures the system based on that. I created it all from just a few example files, so it can be made a whole lot better.



Time to start seeing how it can be mixed into the other code. Feeling very optimistic about this.



http://www.jamtlandoutdoors.se/images/dist/mandelbrot.png

2 Likes

Cool, we were talking about doing something like this for the animation system as well. Do you use the native OpenCL api directly? Did you check out aparapi?

Aparapi is very cool and adds the possibility to run code even on Intel again because if there is no OpenCL available, it just falls back smoothly to Java Thread Pools… using JTPs is especially nice for testing Java code that may throw Exceptions (like ArrayIndexOutOfBoundsException :roll:) because you have the power of e.g. Eclipse’ Debugger to step into it.

@normen said:
Cool, we were talking about doing something like this for the animation system as well. Do you use the native OpenCL api directly? Did you check out aparapi?

No, he uses lwjgl, it wraps OpenCL just like OpenGL, that's the beauty in it, you don't need to add a new jar to the stack.

@normen

That image was created using an un-modified LWJGL test class, from their tests jar. I built some classes and stuff based on that and a few other eamples i found. It is not using the jME GLContext specificly, that’s the part I’m gonna try and do next (and probably fail for quite some time). Gonna try it anyways. It will be interesting to explore the inner workings of jME more, if nothing else.



EDIT: I suppose the reason you ask about aprapi is maybe because in lwjgl they tend to create gl from cl contexts, and not the other way around? Maybe aparapi is better and more practical. Gonna check it out.

I was more talking about wrapping the actual OpenCL part a bit more for the reasons @cmur2 put up and to make it more seamless for users. @sbook played with aparapi and the jme math system a bit and most stuff transfers nicely it seems. Its just pushing the data back and forth, as always, but as indicated we’d have that data in the GPU already, its just about transforming it with the animation system. If it was cross-compilable java code that’d make it much easier.

@normen said:@sbook played with aparapi and the jme math system a bit and most stuff transfers nicely it seems.


It didn't take too long to convert a "standard" CL kernel to Aparapi.. It basically involved taking the logic out of the kernel code and putting it in a Java class.

Performance seemed just about the same. One thing I liked better about the LWJGL way of doing things is that you get more leverage over where the code executes (CPU, GPU, anything available) while Aparapi seemed to just run it somewhere. The automagical-ness is probably something that can be overcome with enough poking (I only spent about a day really doing much with it).

That said, I picked up some things. Benchmark the hell out of the code just running on the JVM before you start jumping to CL for calculations. I needed tens of thousands of calculations before I could start to get to a point where the cost of sending to and from the CL context was overcome. CL/GL memory sharing is a nice feature, but you're still only modifying what the GPU has in memory. That data still needs to find its way back to main memory so that the engine can still properly do its job.

I'm not sure how memory is handled if the CL context is CPU based but it should be significantly faster than sending over the PCIe bus. Of course, if you take advantage of this theoretical advantage you immediately lose the GL/CL memory sharing which was the nice benefit to start with.

Thanks everyone. Very good points. I guess it will not be any noticable boost in performance when updating my example meshes, all things considered.



Maybe it can be good for procedural texture generation like in that image? I got very inspired when reading about Eberts solid spaces, and would really like to try some of his systems (one 3d noise-like texture and one 2d). He seems to use that for a lot of different things.