GPU Particles

Hi guys, here’s the technical explanation of my GPU particles demo. I’m assuming you know at least a little bit about OpenCL, so I won’t be going into how to write an OpenCL kernel or any of that.

My GPU particle implementation and tests are part of my VFX library, if you feel brave enough to give it a try. I’m still finalizing API for it.

General Idea

An OpenCL kernel (also known as a compute shader) is used to calculate the position, color, etc of every particle, since the gpu is good at parallel operations. The calculated data is written to buffers or images inside the kernel, which vertex shaders can access to correctly display each particle.

The cpu is kept out of the loop as much as possible, since it can easily become an unwanted bottleneck. Though, the cpu is still responsible for managing objects and running the kernels themselves.


There are four steps to running gpu particles (and even gpu simulations in general):

  1. Initialize OpenGL resources. This includes creating textures and vertex buffers.
  2. Initialize OpenCL resources. Each buffer and texture you want to control via OpenCL and simultaneously allow OpenGL to use must be bound to a corresponding OpenCL resource. The creation and binding of OpenCL resources occurs on this step.
  3. Run an OpenCL kernel to setup particle data.
  4. Run another OpenCL kernel that updates particle data. Repeat this once every frame.

It is important to allow for at least one render between steps 1 and 2, because OpenCL cannot bind to resources that are not uploaded to the GPU (that happens during render). I typically wait two frames before binding resources.

Storing Data

There are two methods for storing data between OpenCL calls and between OpenCL and OpenGL. One (obviously) is vertex buffers, which are typically easy to handle but are relatively slow. The other method is images, which can sometimes be a royal pain to handle, but provide much much (much) better performance. I used images in my particle demo.

Java Example using Buffers

First, OpenCL must be initialized and a program + kernels created first.

clContext = context.getOpenCLContext();
clQueue = clContext.createQueue().register();
Program program = clContext.createProgramFromSourceFiles(
        assetManager, "Shaders/");;
Kernel initKernel = program.createKernel("initParticleData").register();
Kernel updateKernel = program.createKernel("updateParticleData").register();

And enable OpenCL support in the app settings (before starting the app, ofc).


Also, don’t forget to set the mesh mode to points (like I did :tired_face:), or else no particles will show up.


Step 1:

// set up position buffer
FloatBuffer pb = BufferUtils.createVector3Buffer(numberOfParticles);
VertexBuffer buf = mesh.getBuffer(Type.Position);
if (buf != null) {
} else {
    buf = new VertexBuffer(Type.Position);
    buf.setupData(Usage.Static, 3, Format.Float, pb);

Step 2:

// bind OpenCL buffer to the position buffer
Buffer clPosBuf = clContext.bindVertexBuffer(
        mesh.getBuffer(Type.Position), MemoryAccess.READ_WRITE);

Step 3:

// run initialization kernel
initKernel.Run1NoEvent(clQueue, new Kernel.WorkSize(numberOfParticles), clPosBuf);
// Note: since the buffer will not be used in the CPU, we don't need an event

Step 4:

// run update kernel
updateKernel.Run1NoEvent(clQueue, new Kernel.WorkSize(numberOfParticles), clPosBuf);

Pretty simple. Of course, the main logic behind the particles is in the OpenCL program. The Java here is only supposed manage and support. You can check out a sample OpenCL program here.

Taking Advantage of Images

With buffers, I can achieve around 50,000 particles before it gets slow. In order to support millions of particles, images should be used to store the particle data instead of buffers. The tradeoff to using images is that OpenCL is unable to read and write to the same image during the same call.

To get around this, I’m employing a technique I believe is called “ping-ponging.” Where a kernel reads from image1 and writes to image2 on even frames, and writes to image1 and reads from image2 on odd frames. So it “ping-pongs” back and forth. Of course, this doesn’t have to be for every image; only the ones that are changing based on their current state.

If you want to see how ping-ponging works in code, I’ve developed a class to help manage ping-ponged images that you can go over.

Also, the vertex shader must be modified to read positions from a texture instead of a buffer, and to account for ping-ponging. Here’s an example of a vertex shader that does that.

Caution: Icy Road

I’ve had small mistakes here and there freeze up my application, so I recommend unlocking your cursor when working with OpenCL so you can terminate the application via the sdk in case that happens. I’ve had to restart my pc many times because a frozen application had locked up my cursor. :confounded:


Very cool.
I’ve tested the texture based gpu particles on my system and they work flawlessly.
I wonder if the issue with the buffer is that it is initialized on ram and then uploaded to the gpu, it is possible to initialize the buffer directly on vram and this should make this approach even faster than image sampling, it should be possible from jme as well by passing null as data, if not we can patch the core to support this use case, i suppose.
Since sampling textures from the vertex buffer shader requires “vertex texture fetch” support on the gpu drivers.


Do you mean “vertex shader” instead of “vertex buffer” above?

I do this all the time in many (most) of my shaders. Do we know what cards these days do not support this?

yes, typo, i meant vertex shader.

I don’t have any reliable list, but I suppose all modern amd and nvidia drivers supports it on GLES/GL 3+.
My doubts are regarding the usual suspects: opensource drivers, older intel drivers and android.

I happen to have a recent linux laptop that has the worst opensource intel driver support I’ve ever seen, if you have a jar that i can run there (without opencl or other fancy stuff) i will see what happens.

Also, it is unclear how min/mag filtering is handled in the vertex shader.

If you still have one of the ancient versions of Mythruna from 10+ years ago, even that used texture fetch from the vertex shader to sample noise for wind. So if the tall grass waves in the breeze then texture fetch is working.

New Mythruna uses it for all of the far terrain elevation as all of the terrain mesh data is constant.

I used to worry about support way back in 2011 when I first started using it but in my experience, on particularly potato-quality cards, other GPU limitations seemed to fail before it ever came up.

1 Like

Yes, it would be great if buffer performance could be improved. Even better if they could be more performant than textures.

What do you mean by passing null as data?

In opengl it is possible to pass nullptr as data in glBufferData to allocate an empty buffer on the gpu.
Said that, I’ve double checked the jme source code and it seems we enforce a null check, that makes sense since probably most of the engine expect buffer data to be non-null, I will think if there is a graceful way to support gpu only buffers (maybe adding a GpuOnly usage together with a single value bytebuffer containing the expected length on the gpu as first value?)

1 Like