Hi guys, here’s the technical explanation of my GPU particles demo. I’m assuming you know at least a little bit about OpenCL, so I won’t be going into how to write an OpenCL kernel or any of that.
My GPU particle implementation and tests are part of my VFX library, if you feel brave enough to give it a try. I’m still finalizing API for it.
General Idea
An OpenCL kernel (also known as a compute shader) is used to calculate the position, color, etc of every particle, since the gpu is good at parallel operations. The calculated data is written to buffers or images inside the kernel, which vertex shaders can access to correctly display each particle.
The cpu is kept out of the loop as much as possible, since it can easily become an unwanted bottleneck. Though, the cpu is still responsible for managing objects and running the kernels themselves.
Implementation
There are four steps to running gpu particles (and even gpu simulations in general):
- Initialize OpenGL resources. This includes creating textures and vertex buffers.
- Initialize OpenCL resources. Each buffer and texture you want to control via OpenCL and simultaneously allow OpenGL to use must be bound to a corresponding OpenCL resource. The creation and binding of OpenCL resources occurs on this step.
- Run an OpenCL kernel to setup particle data.
- Run another OpenCL kernel that updates particle data. Repeat this once every frame.
It is important to allow for at least one render between steps 1 and 2, because OpenCL cannot bind to resources that are not uploaded to the GPU (that happens during render). I typically wait two frames before binding resources.
Storing Data
There are two methods for storing data between OpenCL calls and between OpenCL and OpenGL. One (obviously) is vertex buffers, which are typically easy to handle but are relatively slow. The other method is images, which can sometimes be a royal pain to handle, but provide much much (much) better performance. I used images in my particle demo.
Java Example using Buffers
First, OpenCL must be initialized and a program + kernels created first.
clContext = context.getOpenCLContext();
clQueue = clContext.createQueue().register();
Program program = clContext.createProgramFromSourceFiles(
assetManager, "Shaders/MyComputeShader.cl");
program.build();
program.register();
Kernel initKernel = program.createKernel("initParticleData").register();
Kernel updateKernel = program.createKernel("updateParticleData").register();
And enable OpenCL support in the app settings (before starting the app, ofc).
settings.setOpenCLSupport(true);
Also, don’t forget to set the mesh mode to points (like I did ), or else no particles will show up.
mesh.setMode(Mesh.Mode.Points);
Step 1:
// set up position buffer
FloatBuffer pb = BufferUtils.createVector3Buffer(numberOfParticles);
VertexBuffer buf = mesh.getBuffer(Type.Position);
if (buf != null) {
buf.updateData(pb);
} else {
buf = new VertexBuffer(Type.Position);
buf.setupData(Usage.Static, 3, Format.Float, pb);
mesh.setBuffer(buf);
}
Step 2:
// bind OpenCL buffer to the position buffer
Buffer clPosBuf = clContext.bindVertexBuffer(
mesh.getBuffer(Type.Position), MemoryAccess.READ_WRITE);
Step 3:
// run initialization kernel
clPosBuf.acquireBufferForSharingNoEvent(clQueue);
initKernel.Run1NoEvent(clQueue, new Kernel.WorkSize(numberOfParticles), clPosBuf);
clPosBuf.releaseBufferForSharingNoEvent(clQueue);
// Note: since the buffer will not be used in the CPU, we don't need an event
Step 4:
// run update kernel
clPosBuf.acquireBufferForSharingNoEvent(clQueue);
updateKernel.Run1NoEvent(clQueue, new Kernel.WorkSize(numberOfParticles), clPosBuf);
clPosBuf.releaseBufferForSharingNoEvent(clQueue);
Pretty simple. Of course, the main logic behind the particles is in the OpenCL program. The Java here is only supposed manage and support. You can check out a sample OpenCL program here.
Taking Advantage of Images
With buffers, I can achieve around 50,000 particles before it gets slow. In order to support millions of particles, images should be used to store the particle data instead of buffers. The tradeoff to using images is that OpenCL is unable to read and write to the same image during the same call.
To get around this, I’m employing a technique I believe is called “ping-ponging.” Where a kernel reads from image1 and writes to image2 on even frames, and writes to image1 and reads from image2 on odd frames. So it “ping-pongs” back and forth. Of course, this doesn’t have to be for every image; only the ones that are changing based on their current state.
If you want to see how ping-ponging works in code, I’ve developed a class to help manage ping-ponged images that you can go over.
Also, the vertex shader must be modified to read positions from a texture instead of a buffer, and to account for ping-ponging. Here’s an example of a vertex shader that does that.
Caution: Icy Road
I’ve had small mistakes here and there freeze up my application, so I recommend unlocking your cursor when working with OpenCL so you can terminate the application via the sdk in case that happens. I’ve had to restart my pc many times because a frozen application had locked up my cursor.
flyCam.setEnabled(false);
inputManager.setCursorVisible(true);