[Pull Request - Merged] OpenCL for jME3

shamanDevel · April 27, 2016, 8:34am

Hi folks,
I saw many attempts to bring to power of OpenCL to jME3. All of them required direct interaction with the underlying renderer implementation. Therefore, I created a wrapper around the OpenCL api to encouple it from the renderer.

https://github.com/shamanDevel/jmonkeyengine/tree/OpenCL

The following diagram outlines the structure of the api:

The central object is the Context. The creation of all other objects like kernels, buffers and images are control by the context. The context instance is obtained by JmeContext.getOpenCLContext().
From there on, all OpenCL calls are encapsulated in a typesafe class structure.
All classes are placed in the package com.jme3.opencl.

Example usage:

Context context = yourJmeContext.getOpenCLContext(); //aquire the context
CommandQueue queue = context.createQueue(); //create a command queue
Program program = context.createProgramFromSourceFiles(assetManager, "OpenCLTest.cl"); //load a program from sources
program.build(); //build the program
Kernel kernel = program.createKernel("TestKernel"); //create the kernel
Buffer buffer = context.createBuffer(1024); //create a buffer with 1024 bytes
kernel.Run1(queue, new Kernel.WorkSize(1024), buffer, 512, 0.25f); //Call a kernel with three arguments: a buffer, an int and a float

As you can see from the example, calling kernels is especially easy due to the use of var-arg methods.

This API would not be of much use if it doesn’t integrate into the existing jME system.
Therefore, an integral part is the interoperability between OpenCL and jME:

Buffer clBuf = context.bindVertexBuffer(vertexBuffer, MemoryAccess.READ_WRITE); //use a vertex buffer as an OpenCL buffer
Image clImg1 = context.bindImage(jmeTexture, MemoryAccess.READ_WRITE); //use a texture as an OpenCL image
//... and more methods

This allows e.g. to:

modify meshes: particle systems, morphing, animation, mesh deformation
modify textures: dynamic textures, light maps
access the renderbuffer: post-process effects, compute the overall luminance for tone mapping
… whatever you like

I created two test classes in jme3test.opencl showing the interoperation.

A note to the design decisions taken:
Unlike the OpenGL renderer, I did not encapsulate the OpenCL calls in a single CL wrapper class and implement the logic directly in the classes. Instead the classes are all interfaces or abstract classes and the actual implementation is handled by the renderer implementation (currently only lwjgl). There are several reasons for that: the classes are now very light-weight, only hold one pointer to the OpenCL object. No CL wrapper instance has to be passed around. Furthermore, the underlying native bindings are very different: lwjgl has special classes for every OpenCL object while lwjgl3 only passes long values around. Further, lwjgl requires a special PointerBuffer for size parameters. Also the handling of error codes and callback function between the different bindings is not uniform. This would make it very painful to introduce a single CL wrapper class. I found it simpler to implement the logic in subclasses that can adopt to the quirks of the different bindings.

At the moment, the following questions are still open:

Only OpenCL1.2 supported, the addition functions and types introduced in OpenCL2.0 / 2.1 are not included yet
Memory handling: I implemented a similar system like the NativeObjectManager to release unused cl objects. However, especially Event objects are very small and are collected from the gc very late. Therefore, I have to call System.gc() periodically to release these objects, so that I do not run out of native memory. This, however, leads to a huge performance penalty that has to be fixed
The following ideas might fix this:
Extreme way: no automatic releasing, the user has to release every object manually
Only event objects are created so frequently and most of them are not used at all
→ Provide alternative versions of kernel launches, resource request, memory copies, etc. that do not return an event object but release them immediately

Next steps:

Provide the implementation for lwjgl3 and jogl
Cache system for programs similar to the cache system of PyOpenCL
https://documen.tician.de/pyopencl/runtime.html#pyopencl.Program.build
Automatic detection and resolving of #include statements in kernel source code
library of often used functions (I already have them, I just need to port them from C++ to this API):
- BLAS
- 4x4Matrix + Quaternion math
- simple random numbers
- sorting (radix sort + bitonic sort)
Real-world examples
- particle systems
- grid based fluids for smoke, clouds, wind blowing around houses
- particle based fluids (SPH) for water
- …

Any suggestions or ideas?
Then that’s it for now.

Shaman

thetoucher · April 27, 2016, 8:44am

top work mate!

Ali_RS · April 27, 2016, 9:21am

Marvelous

Empire_Phoenix · April 27, 2016, 11:40am

Nice this is great, I kinda look forward to see this a a pull request to main jme in a while
(I still cannot believe that we are nearly state of the art by now, with pbr, tesselation, this might close one of the last larger gaps)

loopies · April 27, 2016, 1:55pm

Omg lol… another one of those peeps that put couch potatoes like me to shame :D. Looks fantastic! Awesome!

Not related… but on your github, the link pointing to attaack of the gellatinous bloob brings a page telling us the site was hacked from Banngladesh :D. Maybe we should warn it’s owner. NB: spelling errors made on purpose to avoid them being able to google to get the visitors (well me lol).

shamanDevel · May 1, 2016, 4:58pm

Update: with the latest commit (a26e526945aba00220a99fb0d734b04257b3163b), I added a binding to Jogamp’s Jocl.
This implementation is highly experimental. Jocl only supports OpenCL 1.1, so some methods will result in an UnsupportedOperationException (including all image types except 2D and 3D and buffer/image filling).

Roadmap:

Alternative gc handling
LWJGL3 binding
Program cache and include resolution
Pull request!
OpenCL libraries (BLAS, Sorting, Fluid simulation) as an external plugin

Ali_RS · May 1, 2016, 6:38pm

Thanks so much.

Searched the web about Jocl and found this amazing demo for cloth simulation which implemented in java.
http://www.jocl.org/cloth/cloth.html

thought to share a video.

Empire_Phoenix · May 1, 2016, 11:04pm

@Sploreg you are the right one, or?

shamanDevel · May 2, 2016, 9:50am

Nice! Maybe we see this soon in jME

shamanDevel · May 2, 2016, 2:23pm

Next update: with commit (54113f35e048dc27653ca9a1c84fea1ac48ee069) I reworked the way the native objects are handled:

All native OpenCL Objects are no longer automatically added to a native object manager (like it’s done in the core for images, meshes, …)
You now have two choices:
- manually free the object with release(). This is the prefered way if you know exactly when you don’t need the resource anymore, as it is the fastest one
add it to the object manager with register(). The object manager uses a ReferenceQueue (I copied much from NativeObjectManager) to detect when a native object is unreachable and releases it
If none of two ways above are used (e.g. you forgot to call register()), the object will be deleted somewhere in the future because release() is also called in the finalize() method. This, however, might take a while.
For often used methods (acquiring shared resources and kernel launches), I provided alternative versions that do not return an event object. Since the actions are all executed in order in a command queue, intermediate events are often not needed. I save an object allocation by that.

In total, these changes improve the performance of e.g. TestVertexBufferSharing from around 220FPS to 280FPS and the memory footprint is much lower.

shamanDevel · May 4, 2016, 1:56pm

And another update: program caching!
I added methods to retrieve the binary code of a program and to load the program from these binaries. The class ProgramCache uses this functionality to cache programs on the hard drive. Applications that load the programs over this cache have a faster startup time after the first run, because a huge chunk of source code generation and compilation can be skipped.

Currently, only LWJGL allows to retrieve the binaries.
Not JOCL: Jocl provides two apis: low-level and high-level. The high-level api would provide the required functions, but the internals are just catastrophic: For every instance they allocated a new buffer, instead of reusing old ones or share them! Other example: before you can read an OpenCL-buffer to the host, you first have to create an OpenCL-buffer that wrapps the host buffer by using its host pointer. Very inefficient. So no high-level api.
So I used the low-level api. Everything worked out until now. But the binding for retrieving the program binaries requires a buffer with pointers to ByteBuffers. Ok, Jocl provides a utility class for that, InternalBufferUtil, but it is package default. Next try: can I just create the high-level wrapper around the program object? Answer: no, the constructor taking only the id of the program is private! Conclusion: retrieving the binaries is not implemented in the Jocl implementation for now.

shamanDevel · May 11, 2016, 8:06am

Hi,
one week has passed and a lot has changed:

Support for lwjgl3
Sadly except for the program cache, it still throws a segfault. Has someone an idea what the problem is? That’s the problematic class: https://github.com/shamanDevel/jmonkeyengine/blob/OpenCL/jme3-lwjgl3/src/main/java/com/jme3/opencl/lwjgl/LwjglProgram.java
Include resolution: Similar to shader code, when you load a program that contains an #import statement, the linked file is automatically loaded and inserted. This allows an easy support of libraries that are spread over multiple projects
Added support for Matrix3f and Matrix4f as kernel arguments (will be mappend to a float16)
Added three example libraries. These libraries can simply be included with the include resolution mechanism described above
- Random numbers: a parallel port of java.util.Random to OpenCL → Common/OpenCL/Random.clh
- 3x3 matrix: a port of com.jme3.math.Matrix3f to OpenCL → Common/OpenCL/Matrix3f.clh
- 4x4 matrix: a port of com.jme3.math.Matrix4f to OpenCL → Common/OpenCL/Matrix4f.clh
Added a test class for these libraries

The main changes to the core are now done, the code is stable and working. “Only” bugfixes are left, including the missing support for the program cache in jocl and lwjgl3. If you use the ProgramCache class and use it as documented, then you won’t see any of these issues.

I would appreciate it if you could help testing the OpenCL wrapper. And if you have already a usage, start using it! To simplify it, I’ve now created a Pull Request.

gouessej · May 27, 2016, 11:39am

Just in case it’s not crystal clear, the JogAmp’s JOCL backend is fully functional and you should be less peremptory about JogAmp’s JOCL especially if you expect some help from our community.

shamanDevel · July 10, 2016, 7:53am

Hi,
some time has passed and I’ve been working with the wrapper quite a bit.
I found three features that were missing and added it in the pull request #527

The changes are:

missing toString() methods, useful for debugging
added getter method for the device in the command queue, necessary e.g. to query the work group size
the register() methods for the automatic garbage collection of OpenCL objects now returns this. This is important so that e.g. Buffer b = clContext.createBuffer(1).register(); can be written in one line

I’m planning to set up a library project as a plugin that includes sorting, fluids and other tools with OpenCL. I hope that I can post the very first version in the next weeks.

Another question: will the wrapper be part of the next jME release?

barney · March 13, 2018, 1:07pm

Hi,

What is the status of this?

Did it make it into JME3? Did development continue?

Also, you mention integration with JME classes/concepts. Does/did this work provide a means to write shaders in OpenCL?

Thanks
Barney

shamanDevel · March 13, 2018, 10:10pm

Hello,
and greetings from New Zealand (I’m on vacation).

Yes, the OpenCL wrapper is in the 3.2 branch and fully functional.
After the initial merge, it has seen one revision through the issues #694 and #695.

OpenCL can’t replace shaders, but what it can do is transform textures and vertex buffers. Hence you could use it to preprocess meshes (physics, animation, …) or to postprocess the final image (if the effect framework is not sufficient) or whatever else you can think of.

I’ve also demonstrated in Sorting Algorithms with OpenCL that the OpenCL wrapper can also be used for general GPU computing without any graphics context.

If you have any further questions, feel free to ask.

barney · March 14, 2018, 2:52pm

Hi, thanks for the quick response (enjoy the vacation!).

I was under the impression that OpenCL is perfectly capable of replacing GL shaders - I found a code example here (ignore the question, I’m just linking to the code). Or is there some reason that approach might not be performant compared to using GLSL?

When you say “postprocess the final image”, I take it you mean screenspace effects (such as bloom)?

I have written a large number of image processing kernels in OpenCL many of which might be useful as shaders and/or screenspace effects, and would be interested to try some of them out in JME.

Thanks
Barney

danielp · March 14, 2018, 3:16pm

I’m not sure of all the details, but I’ve heard that OpenCL + OpenGL tends to suffer from some one inefficiencies that OpenGL compute shaders do not have. Unfortunately jME doesn’t support compute shaders yet, so for the moment we’re restricted to OpenCL for GPU compute (and from what I’ve heard there are some tasks that OpenCL still fits better than compute shaders since the latter are specialized for computations in a graphics pipeline).

shamanDevel · March 14, 2018, 7:31pm

OpenCL can interact with OpenGL textures and vertex buffers. This means that you can directly use your screen-space effects that read from one texture and write into another. (Exactly like the GLSL effects)
OpenCL, however, can’t replace something like the geometry shader where you process primitives without intermediate memory I/O.
Further, you have to take care of the memory synchronization yourself. Therefore, is the postprocessing effect can equally be implemented in GLSL, this might be more efficient than the OpenCL version.

aegroto · March 8, 2019, 8:20am

Hi! What’s the state of this? I tried using it in my application, but getOpenCLContext() returns null.