First attempt, still a little rigid.
A little code sample showing how to make Win32 like OpenGL 3.x and 4.x rendering contexts which can be attached to panels within a CLR Windows Form Application. Mixing Native C/C++ with C++/CLI can be pretty useful for making quick tools with shared code bases.
The basic code is in gl_canvas.h. The download contains a sample which makes 3 OpenGL canvases, two of which share a rendering context. It also uses a Timer to update the drawing of the panels. See the comments for more details.
In my current project I am using the Prefix Sum to help me pack data processed on the GPU into an OpenGL buffer that is to be used as a position input for instanced drawing. I am doing this because it allows me to pack all the data I will use at the start of the buffer and render only the valid, on screen instances of data.
The basics of Prefix sum are covered in the link above, and a simple example of how to implement one in parallel can be found here. A more complex GPU efficient explanation can be found on the NVIDIA site. The NVIDIA link uses CUDA but if, like me, you are using OpenCL a version of it can be found in the OpenCL Code Examples in the NVidia GPU Computing SDK, under “Scan”.
Packing of the data is covered here Though I think the explanation is a little unclear “if P[i]==P[i+1]” should be “if P[i]!=P[i+1]” shouldn’t it? In fact I think there are some mistakes in link from the same site written above too, but it covers the basics in an easy to understand manner.
After processing the data on the GPU, counting the number of valid entries, and packing the data the number of instances I need to render must be passed to OpenGL. Currently I read back the number to the CPU and then pass it to OpenGL via glDrawElementsInstanced though it seem the need to read back the data is removed by glDrawElementsIndirect in GL4 or the GL_ARB_draw_indirect extension in GL3.1
which I will try out later (along with moving my instanced data from a Texture Buffer to a Vertex one see: GL_ARB_instanced_arrays
and Finally i read this interesting article on a newer extension to drawing using data on the GPU today : http://rastergrid.com/blog/2011/06/multi-draw-indirect-is-here/
I have been super busy recently with various things: including but not limited to a trip home for the first time in 6 years. But that doesn’t mean I haven’t had time for a little home coding.
I had a quick go at porting Ericsson’s ETC-Pack to use OpenCL. It really was a rushed hack job to see what kind of speed ups I could expect, and as such I have only ported the SLOW option when converting from .ppm to .ktx file formats. The results were quite good.
It takes 112.74 seconds to convert the supplied “elina.ktx” sample on my i7 920 using the standard version. Using the OpenCL version it takes 24.1 seconds on the same CPU and on a GTX 460 it comes down to only 6.0 seconds, roughly 18.5 times faster than the original.
I have attached the CL kernel, which needs to be run in void compressImageFile(…) in place of the 2D loop near the bottom, something like this:
void compressImageFile(uint8 *img,int width,int height,char *dstfile, int expandedwidth, int expandedheight, int action)
{
// original source here ...
int countblocks = 0;
#ifdef TRY_CL //new openCL stuff
//--------------------------------------
//OPENCL STUFF
//--------------------------------------
cl_int error;
cl_mem src = clCreateBuffer(m_context.m_handle, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 3 * width * height, img, &error);
int writesize = ((expandedheight/4) * (expandedwidth/4) * 2);
cl_mem dst = clCreateBuffer(m_context.m_handle, CL_MEM_WRITE_ONLY, sizeof(unsigned int) * writesize, NULL, &error);
unsigned int dimx = expandedwidth/4;
unsigned int dimy = expandedheight/4;
// In the next step we associate the GPU memory with the Kernel arguments
error = clSetKernelArg(m_kernel, 0, sizeof(cl_mem), (void*)&(src));
error |= clSetKernelArg(m_kernel, 1, sizeof(cl_int), (void*)&expandedwidth);
error |= clSetKernelArg(m_kernel, 2, sizeof(cl_int), (void*)&expandedheight);
error |= clSetKernelArg(m_kernel, 3, sizeof(cl_int), (void*)&dimx);
error |= clSetKernelArg(m_kernel, 4, sizeof(cl_int), (void*)&dimy);
error = clSetKernelArg(m_kernel, 5, sizeof(cl_mem), (void*)&(dst));
m_workSize[0] = expandedheight / 4;
m_workSize[1] = expandedwidth / 4;
error |= clEnqueueNDRangeKernel(m_context.m_cmdQueue, m_kernel, 2, NULL, m_workSize, NULL, 0, NULL, NULL);
unsigned int* op = new unsigned int[writesize];
clEnqueueReadBuffer(m_context.m_cmdQueue, dst, CL_TRUE, 0, sizeof(unsigned int) * writesize, op, 0, NULL, NULL);
for(y=0;y<writesize;y+=2)
{
write_big_endian_4byte_word(&op[y], f);
write_big_endian_4byte_word(&op[y+1], f);
}
//*/
#else //original code
/// xxx
for(y=0;y<expandedheight/4;y++)
{
for(x=0;x<expandedwidth/4;x++)
{
//cut for brevity
}
}
#endif
fclose(f);
//you get the idea
}
The kernel: compress.txt
If you are going to be running it on the GPU in it’s current state you’ll probably need to change your Time Detection And Recovery Settings.
If you are going to be running it on the GPU in it’s current state you’ll probably need to change your Time Detection And Recovery Settings.
Trying to make the tools/script for Kinect to Blender Animations I wrote last week a little easier to use. Currently it can anchor a bone to the origin, and change the scale of the exported bones (Though I need to actually add the controls to the form to let the user access those functions). I will also let you set the file name and location of the exported tools. When it’s done i’ll probably put the source for it up here.


