Simple way to create OpenCL programm

Serg_ · September 29, 2015, 4:03pm

In this topic I described how to run OpenCL programm. Moreover you can see there how update data in OpenCL programm.

Code was written on LWGL’s OpenCL bindings and work in JMonkey SDK 3.0 with primarily libraries.

At first, remark that OpenCL working with 1-d arrays and data sending to GPU memory through buffers.

Initialization

Imports:

    import com.jme3.app.Application;
    import com.jme3.math.Vector3f;
    import org.lwjgl.BufferUtils;
    import org.lwjgl.PointerBuffer;
    import org.lwjgl.opencl.*;
    import org.lwjgl.opengl.Display;
    import org.lwjgl.opengl.Drawable;
    import java.nio.FloatBuffer;
    import java.util.List;

There is loading and building OpenCL programm, set working devices:

Application.getAssetManager().registerLoader(CLLoader.class, "cl");

String source=(String)Application.getAssetManager().loadAsset("programm.cl"); 
CL.create();
           
CLPlatform platform=CLPlatform.getPlatforms().get(0);
List<CLDevice> devices=platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);

CL_CONTEXT=CLContext.create(platform, devices, null);
CL_QUEUE=CL10.clCreateCommandQueue(CL_CONTEXT, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, null);

CL_PROGRAM=CL10.clCreateProgramWithSource(CL_CONTEXT,source,null);
Util.checkCLError(CL10.clBuildProgram(CL_PROGRAM,devices.get(0),"",null));

CL_KERNEL=CL10.clCreateKernel(CL_PROGRAM,"main",null);

Find CLLoader.class there OpenCL.rar - Google Drive

Converting arrays

OpenCL working with 1-d arrays, that is why necessary convert multidimension arrays to 1-d (although I was see suggest to use “Image” type, but this method complicated and I dont consider his).

To convert 2-d array [numColumns] [numRows] in 1-d I use this order:

creating new 1-d array with lenght equals]:

lenght=[numColumns*numRows];
for acces to the element of 1-d array is used formula column+numColumns*row, where element of 1-d array[lenght] to be the element from 2-d array[column][row].

Creating buffers

Array data, unlike to simple numbers, sending to GPU memory through buffers in this order:

creating the buffer with org.lwjgl.BufferUtils or com.jme3.util.BufferUtils, but jme3 do it simpler and support Vector3f arrays:

FloatBuffer Buffer_data=com.jme3.util.BufferUtils.createFloatBuffer(Data);

With jme3 buffer created in one string and Vector3f array pass like x1,y1,z1,x2,y2,z2…
2) creating on the GPU memory object, that was puted in FloatBuffer:

CLMem CL_data=CL10.clCreateBuffer(CL_CONTEXT, CL10.CL_MEM_READ_WRITE | CL10.CL_MEM_COPY_HOST_PTR, Buffer_data, null);

For get result in new float array we need to create the buffer with expected size equal array’s lenght * 4:

CLMem resultMemory = CL10.clCreateBuffer(CL_CONTEXT, CL10.CL_MEM_READ_WRITE, lenght*4, null);

Set up data with OpenCL program arguments:

CLKernel CL_KERNEL.setArg(0,CL_data);
CLKernel CL_KERNEL.setArg(1,resultMemory);
CLKernel CL_KERNEL.setArg(2,lenght);

Set number of workers

PointerBuffer  CL_WORKER=org.lwjgl.BufferUtils.createPointerBuffer(1); // set dimension 
CL_WORKER.put(0,lenght);  //set number of workers, agree to elements in array with we works

OpenCL programm

_kernel void main(
        __global float *data, // !dont use float3 for Vector3f array
        __global float *res,
        int numInter    
        ){
   unsigned int i = get_global_id(0); // get worker id
if (i>numInter) return; // as though "limitation", number of workers true equal array size 
res[i] = data[i];
}

Example: interpolate pair Vector3f from Vector3f array

res[i*3]  = (data[i*3+0]+data[i*3+3])/2;    
res[i*3+1]= (data[i*3+1]+data[i*3+4])/2; 
res[i*3+2]= (data[i*3+2]+data[i*3+5])/2;

Run programm

CL10.clEnqueueNDRangeKernel(CL_QUEUE,CL_KERNEL,1,null,CL_WORKER,null,null,null);            
CL10.clFinish(CL_QUEUE);

It is best do it in controlUpdate() cycle.

Copy result from GPU memory

FloatBuffer resultBuff = com.jme3.util.BufferUtils.createFloatBuffer(length);
CL10.clEnqueueReadBuffer(CL_QUEUE, resultMemory, CL10.CL_TRUE, 0, resultBuff, null, null);
float result[]=new float[resultBuff.capacity()]; //since resultBuff.capacity() equal length
for(int i = 0; i < resultBuff.capacity(); i++) {
             result[i]=resultBuff.get(i);      }

for Vector3f :

  for(int i = 0; i < resultBuff.capacity()/3; i++) {
    vector[i].x=resultBuff.get(i*3+0)
    vector[i].y=resultBuff.get(i*3+1);
    vector[i].z=resultBuff.get(i*3+2);}

Remark - if expected size of resultBuff > resultMemory, then resultBuff to be equal zero array.

Update data

In update cycle possible update data in GPU memory:

FloatBuffer updata_buffer=com.jme3.util.BufferUtils.createFloatBuffer(data);  // new data
CL10.clEnqueueWriteBuffer(CL_QUEUE, CL_data, 1, 0, updata_buffer, null, null);

or copy data from one to other CLMem buffer:

CLMem CL_updata=CL10.clCreateBuffer(CL_CONTEXT, CL10.CL_MEM_READ_WRITE | 
                                   CL10.CL_MEM_COPY_HOST_PTR, updata_buffer, null);   
CL10.clEnqueueCopyBuffer(CL_QUEUE, CL_updata, CL_POINTS, 0, 0, length, null, null);

for 1x1 data:

CL_KERNEL.setArg(3,tpf);

Vertex buffer

OpenCL give us possibility to work with vertex buffer
For this puts Drawable in CLContext and GLBuffer in CLMem object:

VertexBuffer vertex_buffer=mesh.getBuffer(VertexBuffer.Type.Position);
Drawable drawable=Display.getDrawable();
CLContext CL_CONTEXT=CLContext.create(platform, devices, null, drawable, null);
CLMem CL_VERTICES=CL10GL.clCreateFromGLBuffer(CL_CONTEXT, CL10.CL_MEM_READ_WRITE, vertex_buffer.getId(), null);

In that case update cycle code:

CL10GL.clEnqueueAcquireGLObjects(CL_QUEUE,CL_VERTICES,null,null);
CL10.clEnqueueNDRangeKernel(CL_QUEUE, CL_KERNEL, 1, null, CL_WORKER, null, null, null);
CL10GL.clEnqueueReleaseGLObjects(CL_QUEUE, CL_VERTICES, null, null);
CL10.clFinish(CL_QUEUE);CL10GL.clEnqueueReleaseGLObjects(CL_QUEUE, CL_VERTICES, null, null);

and not demand call CL10.clEnqueueWriteBuffer for update data.

Compute vertices position like float array:

__kernel void main(
        __global float *vertices,
	...

float3 move=(float3)(x,y,z);
int id=i*3;
    vertices[id]=vertices[id]+move.x;
    vertices[id+1]=vertices[id+1]+move.y;
    vertices[id+2]=vertices[id+2]+move.z;

!!!This dont work:

	... __global float3 *vertices
...vertices[i].x=vertices[i].x+move.x;

After finish programm we need to release memory:

    CL10.clReleaseMemObject(CL_data);
    CL10.clReleaseMemObject(resultMemory);
            
    CL10.clReleaseKernel(CL_KERNEL);
    CL10.clReleaseProgram(CL_PROGRAM);
    CL10.clReleaseCommandQueue(CL_QUEUE);
    CL10.clReleaseContext(CL_CONTEXT);
    CL.destroy();

It is enough to begin writing Java programms with computation on OpenCL. Moreover in OpenCL also possible to create a user event and managing Image objects.