1、算法簡述
實現矩陣相加:Cn = An + Bn。這個例子雖然很簡單,但是由于矩陣元素之間相互獨立,每個元素可以非常容易地進行并行計算,可以非常理想地在OpenCL中實現。
2. C/C++實現
- /*?
- ?*?This?confidential?and?proprietary?software?may?be?used?only?as?
- ?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
- ?*????(C)?COPYRIGHT?2013?ARM?Limited?
- ?*????????ALL?RIGHTS?RESERVED?
- ?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
- ?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
- ?*?by?a?licensing?agreement?from?ARM?Limited.?
- ?*/??
- ??
- #include?<iostream>??
- ??
- using?namespace?std;??
- ??
- /**?
- ?*?\brief?Basic?integer?array?addition?implemented?in?C/C++.?
- ?*?\details?A?sample?which?shows?how?to?add?two?integer?arrays?and?store?the?result?in?a?third?array.?
- ?*??????????No?OpenCL?code?is?used?in?this?sample,?only?standard?C/C++.?The?code?executes?only?on?the?CPU.?
- ?*?\return?The?exit?code?of?the?application,?non-zero?if?a?problem?occurred.?
- ?*/??
- int?main(void)??
- {??
- ????/*?[Setup?memory]?*/??
- ????/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
- ????int?arraySize?=?1000000;??
- ??
- ????/*?Arrays?to?hold?the?input?and?output?data.?*/??
- ????int*?inputA?=?new?int[arraySize];??
- ????int*?inputB?=?new?int[arraySize];??
- ????int*?output?=?new?int[arraySize];??
- ????/*?[Setup?memory]?*/??
- ??
- ????/*?Fill?the?arrays?with?data.?*/??
- ????for?(int?i?=?0;?i?<?arraySize;?i++)??
- ????{??
- ????????inputA[i]?=?i;??
- ????????inputB[i]?=?i;??
- ????}??
- ??
- ????/*?[C/C++?Implementation]?*/??
- ????for?(int?i?=?0;?i?<?arraySize;?i++)??
- ????{??
- ????????output[i]?=?inputA[i]?+?inputB[i];??
- ????}??
- ????/*?[C/C++?Implementation]?*/??
- ??
- ????/*?Uncomment?the?following?block?to?print?results.?*/??
- ????/*?
- ????for?(int?i?=?0;?i?<?arraySize;?i++)?
- ????{?
- ????????cout?<<?"i?=?"?<<?i?<<?",?output?=?"?<<??output[i]?<<?"\n";?
- ????}?
- ????*/??
- ??
- ????delete[]?inputA;??
- ????delete[]?inputB;??
- ????delete[]?output;??
- }??
3 Open基本實現
3.1 內核代碼實現
內核代碼的實現如下,其中指針的修飾符restrict是C99中的關鍵字,只用于限定指針。該關鍵字用于告知編譯器,所有修改該指針所指向內容的操作全部都是基于該指針的,即不存在其它進行修改操作的途徑;這樣的后果是幫助編譯器進行更好的代碼優化,生成更有效率的匯編代碼。
- /*?
- ?*?This?confidential?and?proprietary?software?may?be?used?only?as?
- ?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
- ?*????(C)?COPYRIGHT?2013?ARM?Limited?
- ?*????????ALL?RIGHTS?RESERVED?
- ?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
- ?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
- ?*?by?a?licensing?agreement?from?ARM?Limited.?
- ?*/??
- ??
- /**?
- ?*?\brief?Hello?World?kernel?function.?
- ?*?\param[in]?inputA?First?input?array.?
- ?*?\param[in]?inputB?Second?input?array.?
- ?*?\param[out]?output?Output?array.?
- ?*/??
- /*?[OpenCL?Implementation]?*/??
- __kernel?void?hello_world_opencl(__global?int*?restrict?inputA,??
- ?????????????????????????????????__global?int*?restrict?inputB,??
- ?????????????????????????????????__global?int*?restrict?output)??
- {??
- ????/*?
- ?????*?Set?i?to?be?the?ID?of?the?kernel?instance.?
- ?????*?If?the?global?work?size?(set?by?clEnqueueNDRangeKernel)?is?n,?
- ?????*?then?n?kernels?will?be?run?and?i?will?be?in?the?range?[0,?n?-?1].?
- ?????*/??
- ????int?i?=?get_global_id(0);??
- ??
- ????/*?Use?i?as?an?index?into?the?three?arrays.?*/??
- ????output[i]?=?inputA[i]?+?inputB[i];??
- }??
- /*?[OpenCL?Implementation]?*/??
3.2 宿主機代碼實現
內核代碼中并沒有循環語句,只計算一個矩陣元素的值,每一個實例獲得一個獨一無二的所以需要運行的內核實例數目等同于矩陣元素個數。
- /*?
- ????*?Each?instance?of?our?OpenCL?kernel?operates?on?a?single?element?of?each?array?so?the?number?of?
- ????*?instances?needed?is?the?number?of?elements?in?the?array.?
- ????*/??
- ???size_t?globalWorksize[1]?=?{arraySize};??
- ???/*?Enqueue?the?kernel?*/??
- ???if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
- ???{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ???}??
因為我們并沒有設置內核間的依賴性,OpenCL設備可以用并行的方式自由地運行內核實例。現在并行化上的唯一限制是設備的容量。在前面的代碼運行之前,需要建立OpenCL,下面分別介紹與建立OpenCL相關的各項內容。
因為現在的操作是在GPU而不是CPU中,我們需要知道任何使用數據的位置。知道數據是在GPU內存空間還是CPU內存空間是非常重要的。在桌面系統中,GPU和CPU有它們自己的內存空間,被相對低速率的總線分開,這意味著在GPU和CPU之間共享數據是一個代價高昂的操作。在大多數帶Mali-T600系列GPU的嵌入式系統中,GPU和CPU共享同一個內存,因此這使得以相對低的代價共享GPU和CPU之間內存成為可能。
由于這些系統的差異,OpenCL支持多種分配和共享設備間內存的方式。下面是一種共享設備間內存的方式,目的是減少從一個設備到另一個設備的內存拷貝(在一個共享內存系統中)。
a. 要求OpenCL設備分配內存
在C/C++實現中,我們使用數組來分配內存。
- /*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
- int?arraySize?=?1000000;??
- /*?Arrays?to?hold?the?input?and?output?data.?*/??
- int*?inputA?=?new?int[arraySize];??
- int*?inputB?=?new?int[arraySize];??
- int*?output?=?new?int[arraySize];??
- /*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
- cl_int?arraySize?=?1000000;??
- /*?The?buffers?are?the?size?of?the?arrays.?*/??
- size_t?bufferSize?=?arraySize?*?sizeof(cl_int);??
- /*?
- ?*?Ask?the?OpenCL?implementation?to?allocate?buffers?for?the?data.?
- ?*?We?ask?the?OpenCL?implemenation?to?allocate?memory?rather?than?allocating?
- ?*?it?on?the?CPU?to?avoid?having?to?copy?the?data?later.?
- ?*?The?read/write?flags?relate?to?accesses?to?the?memory?from?within?the?kernel.?
- ?*/??
- bool?createMemoryObjectsSuccess?=?true;??
- memoryObjects[0]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
- createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- memoryObjects[1]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
- createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- memoryObjects[2]?=?clCreateBuffer(context,?CL_MEM_WRITE_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
- createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- if?(!createMemoryObjectsSuccess)??
- {??
- ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????cerr?<<?"Failed?to?create?OpenCL?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????return?1;??
- }??
b. 映射內存到局部指針
現在內存已分配,但是只有OpenCL實現知道它的位置。為了訪問CPU上的內存,我們把它們映射到一個指針。
- /*?Map?the?memory?buffers?created?by?the?OpenCL?implementation?to?pointers?so?we?can?access?them?on?the?CPU.?*/??
- bool?mapMemoryObjectsSuccess?=?true;??
- cl_int*?inputA?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[0],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
- mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- cl_int*?inputB?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[1],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
- mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- if?(!mapMemoryObjectsSuccess)??
- {??
- ???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???return?1;??
- }??
現在這些指針可以想普通的C/C++指針那樣使用了。
c. 在CPU上初始化數據
因為我們已有了指向內存的指針,這一步與在CPU上一樣。
- for?(int?i?=?0;?i?<?arraySize;?i++)??
- {??
- ???inputA[i]?=?i;??
- ???inputB[i]?=?i;??
- }??
d. 取消映射緩沖區
為了使OpenCL設備使用緩沖區,我們必須把它們在CPU上的映射取消。
- /*?
- ?*?Unmap?the?memory?objects?as?we?have?finished?using?them?from?the?CPU?side.?
- ?*?We?unmap?the?memory?because?otherwise:?
- ?*?-?reads?and?writes?to?that?memory?from?inside?a?kernel?on?the?OpenCL?side?are?undefined.?
- ?*?-?the?OpenCL?implementation?cannot?free?the?memory?when?it?is?finished.?
- ?*/??
- if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[0],?inputA,?0,?NULL,?NULL)))??
- {??
- ???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???return?1;??
- }??
- if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[1],?inputB,?0,?NULL,?NULL)))??
- {??
- ???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???return?1;??
- }??
e. 映射數據到內核
在我們調度內核運行之前,我們必須告訴內核哪些數據作為輸入使用。這里,我們映射內存對象到OpenCL內核函數的參數中。
- bool?setKernelArgumentsSuccess?=?true;??
- setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?0,?sizeof(cl_mem),?&memoryObjects[0]));??
- setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?1,?sizeof(cl_mem),?&memoryObjects[1]));??
- setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?2,?sizeof(cl_mem),?&memoryObjects[2]));??
- if?(!setKernelArgumentsSuccess)??
- {??
- ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????cerr?<<?"Failed?setting?OpenCL?kernel?arguments.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????return?1;??
- }??
f. 運行內核
對于內核代碼見前面,如何調度它則不作詳述。
g. 獲取運行結果
一旦計算結束,我們像映射輸入緩沖區那樣映射輸出緩沖區。然后,我們就可以使用指針讀取結果數據,然后取消緩沖區映射,就像前面那樣。
基本實現的宿主機的完整代碼如下:
- /*?
- ?*?This?confidential?and?proprietary?software?may?be?used?only?as?
- ?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
- ?*????(C)?COPYRIGHT?2013?ARM?Limited?
- ?*????????ALL?RIGHTS?RESERVED?
- ?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
- ?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
- ?*?by?a?licensing?agreement?from?ARM?Limited.?
- ?*/??
- ??
- #include?"common.h"??
- #include?"image.h"??
- ??
- #include?<CL/cl.h>??
- #include?<iostream>??
- ??
- using?namespace?std;??
- ??
- /**?
- ?*?\brief?Basic?integer?array?addition?implemented?in?OpenCL.?
- ?*?\details?A?sample?which?shows?how?to?add?two?integer?arrays?and?store?the?result?in?a?third?array.?
- ?*??????????The?main?calculation?code?is?in?an?OpenCL?kernel?which?is?executed?on?a?GPU?device.?
- ?*?\return?The?exit?code?of?the?application,?non-zero?if?a?problem?occurred.?
- ?*/??
- int?main(void)??
- {??
- ????cl_context?context?=?0;??
- ????cl_command_queue?commandQueue?=?0;??
- ????cl_program?program?=?0;??
- ????cl_device_id?device?=?0;??
- ????cl_kernel?kernel?=?0;??
- ????int?numberOfMemoryObjects?=?3;??
- ????cl_mem?memoryObjects[3]?=?{0,?0,?0};??
- ????cl_int?errorNumber;??
- ??
- ????if?(!createContext(&context))??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?to?create?an?OpenCL?context.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ??
- ????if?(!createCommandQueue(context,?&commandQueue,?&device))??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?to?create?the?OpenCL?command?queue.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ??
- ????if?(!createProgram(context,?device,?"assets/hello_world_opencl.cl",?&program))??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?to?create?OpenCL?program."?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ??
- ????kernel?=?clCreateKernel(program,?"hello_world_opencl",?&errorNumber);??
- ????if?(!checkSuccess(errorNumber))??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?to?create?OpenCL?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ??
- ????/*?[Setup?memory]?*/??
- ????/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
- ????cl_int?arraySize?=?1000000;??
- ??
- ????/*?The?buffers?are?the?size?of?the?arrays.?*/??
- ????size_t?bufferSize?=?arraySize?*?sizeof(cl_int);??
- ??
- ????/*?
- ?????*?Ask?the?OpenCL?implementation?to?allocate?buffers?for?the?data.?
- ?????*?We?ask?the?OpenCL?implemenation?to?allocate?memory?rather?than?allocating?
- ?????*?it?on?the?CPU?to?avoid?having?to?copy?the?data?later.?
- ?????*?The?read/write?flags?relate?to?accesses?to?the?memory?from?within?the?kernel.?
- ?????*/??
- ????bool?createMemoryObjectsSuccess?=?true;??
- ??
- ????memoryObjects[0]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
- ????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- ??
- ????memoryObjects[1]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
- ????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- ??
- ????memoryObjects[2]?=?clCreateBuffer(context,?CL_MEM_WRITE_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
- ????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- ??
- ????if?(!createMemoryObjectsSuccess)??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?to?create?OpenCL?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ????/*?[Setup?memory]?*/??
- ??
- ????/*?[Map?the?buffers?to?pointers]?*/??
- ????/*?Map?the?memory?buffers?created?by?the?OpenCL?implementation?to?pointers?so?we?can?access?them?on?the?CPU.?*/??
- ????bool?mapMemoryObjectsSuccess?=?true;??
- ??
- ????cl_int*?inputA?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[0],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
- ????mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- ??
- ????cl_int*?inputB?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[1],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
- ????mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
- ??
- ????if?(!mapMemoryObjectsSuccess)??
- ????{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ????}??
- ????/*?[Map?the?buffers?to?pointers]?*/??
- ??
- ????/*?[Initialize?the?input?data]?*/??
- ????for?(int?i?=?0;?i?<?arraySize;?i++)??
- ????{??
- ???????inputA[i]?=?i;??
- ???????inputB[i]?=?i;??
- ????}??
- ????/*?[Initialize?the?input?data]?*/??
- ??
- ????/*?[Un-map?the?buffers]?*/??
- ????/*?
- ?????*?Unmap?the?memory?objects?as?we?have?finished?using?them?from?the?CPU?side.?
- ?????*?We?unmap?the?memory?because?otherwise:?
- ?????*?-?reads?and?writes?to?that?memory?from?inside?a?kernel?on?the?OpenCL?side?are?undefined.?
- ?????*?-?the?OpenCL?implementation?cannot?free?the?memory?when?it?is?finished.?
- ?????*/??
- ????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[0],?inputA,?0,?NULL,?NULL)))??
- ????{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ????}??
- ??
- ????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[1],?inputB,?0,?NULL,?NULL)))??
- ????{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ????}??
- ????/*?[Un-map?the?buffers]?*/??
- ??
- ????/*?[Set?the?kernel?arguments]?*/??
- ????bool?setKernelArgumentsSuccess?=?true;??
- ????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?0,?sizeof(cl_mem),?&memoryObjects[0]));??
- ????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?1,?sizeof(cl_mem),?&memoryObjects[1]));??
- ????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?2,?sizeof(cl_mem),?&memoryObjects[2]));??
- ??
- ????if?(!setKernelArgumentsSuccess)??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?setting?OpenCL?kernel?arguments.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ????/*?[Set?the?kernel?arguments]?*/??
- ??
- ????/*?An?event?to?associate?with?the?Kernel.?Allows?us?to?retrieve?profiling?information?later.?*/??
- ????cl_event?event?=?0;??
- ??
- ????/*?[Global?work?size]?*/??
- ????/*?
- ?????*?Each?instance?of?our?OpenCL?kernel?operates?on?a?single?element?of?each?array?so?the?number?of?
- ?????*?instances?needed?is?the?number?of?elements?in?the?array.?
- ?????*/??
- ????size_t?globalWorksize[1]?=?{arraySize};??
- ????/*?Enqueue?the?kernel?*/??
- ????if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ????/*?[Global?work?size]?*/??
- ??
- ????/*?Wait?for?kernel?execution?completion.?*/??
- ????if?(!checkSuccess(clFinish(commandQueue)))??
- ????{??
- ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????????cerr?<<?"Failed?waiting?for?kernel?execution?to?finish.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????????return?1;??
- ????}??
- ??
- ????/*?Print?the?profiling?information?for?the?event.?*/??
- ????printProfilingInfo(event);??
- ????/*?Release?the?event?object.?*/??
- ????if?(!checkSuccess(clReleaseEvent(event)))??
- ????{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Failed?releasing?the?event?object.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ????}??
- ??
- ????/*?Get?a?pointer?to?the?output?data.?*/??
- ????cl_int*?output?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[2],?CL_TRUE,?CL_MAP_READ,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
- ????if?(!checkSuccess(errorNumber))??
- ????{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ????}??
- ??
- ????/*?[Output?the?results]?*/??
- ????/*?Uncomment?the?following?block?to?print?results.?*/??
- ????/*?
- ????for?(int?i?=?0;?i?<?arraySize;?i++)?
- ????{?
- ????????cout?<<?"i?=?"?<<?i?<<?",?output?=?"?<<??output[i]?<<?"\n";?
- ????}?
- ????*/??
- ????/*?[Output?the?results]?*/??
- ??
- ????/*?Unmap?the?memory?object?as?we?are?finished?using?them?from?the?CPU?side.?*/??
- ????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[2],?output,?0,?NULL,?NULL)))??
- ????{??
- ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ???????return?1;??
- ????}??
- ??
- ????/*?Release?OpenCL?objects.?*/??
- ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- }??
4 向量化你的OpenCL代碼
4.1 向量基礎
OpenCL設備可以通告它們為不同數據類型的首選向量寬度,你可以使用這個信息來選擇一個內核。結果是,相當于該內核為你正在運行的平臺做了優化。例如,一個設備可能僅有標量整數的硬件支持,而另一個設備則有寬度為4的整數向量的硬件支持。可以寫兩個版本的內核,一個用于標量,一個用于向量,在運行時選擇正確的版本。
這里是一個在特定設備上詢問首選整數向量寬度的例子。
- /*?
- ?*?Query?the?device?to?find?out?it's?prefered?integer?vector?width.?
- ?*?Although?we?are?only?printing?the?value?here,?it?can?be?used?to?select?between?
- ?*?different?versions?of?a?kernel.?
- ?*/??
- cl_uint?integerVectorWidth;??
- clGetDeviceInfo(device,?CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT,?sizeof(cl_uint),?&integerVectorWidth,?NULL);??
- cout?<<?"Prefered?vector?width?for?integers:?"?<<?integerVectorWidth?<<?endl;??
每一個Mali T600系列GPU核最少有兩個128位寬度的ALU(算數邏輯單元),它們具有矢量計算能力。ALU中的絕大多數操作(例如,浮點加,浮點乘,整數加,整數乘),可以以128位向量數據操作(例如,char16, short8, int4, float4)。使用前面講述的詢問方法來為你的數據類型決定使用正確的向量大小。
當使用Mali T600系列GPU時,我們推薦在任何可能的地方使用向量。
4.2 向量化代碼
首先,修改內核代碼以支持向量運算。對于Mali T600系列GPU來說,一個向量運算的時間與一個整數加法的時間是一樣的。具體代碼解讀,見下面代碼中的注釋部分。
- __kernel?void?hello_world_vector(__global?int*?restrict?inputA,??
- ?????????????????????????????????__global?int*?restrict?inputB,??
- ?????????????????????????????????__global?int*?restrict?output)??
- {??
- ????/*?
- ?????*?We?have?reduced?the?global?work?size?(n)?by?a?factor?of?4?compared?to?the?hello_world_opencl?sample.?
- ?????*?Therefore,?i?will?now?be?in?the?range?[0,?(n?/?4)?-?1].?
- ?????*/??
- ????int?i?=?get_global_id(0);??
- ????/*?
- ?????*?Load?4?integers?into?'a'.?
- ?????*?The?offset?calculation?is?implicit?from?the?size?of?the?vector?load.?
- ?????*?For?vloadN(i,?p),?the?address?of?the?first?data?loaded?would?be?p?+?i?*?N.?
- ?????*?Load?from?the?data?from?the?address:?inputA?+?i?*?4.?
- ?????*/??
- ????int4?a?=?vload4(i,?inputA);??
- ????/*?Do?the?same?for?inputB?*/??
- ????int4?b?=?vload4(i,?inputB);??
- ????/*?
- ?????*?Do?the?vector?addition.?
- ?????*?Store?the?result?at?the?address:?output?+?i?*?4.?
- ?????*/??
- ????vstore4(a?+?b,?i,?output);??
- }??
- /*?
- ?*?Each?instance?of?our?OpenCL?kernel?now?operates?on?4?elements?of?each?array?so?the?number?of?
- ?*?instances?needed?is?the?number?of?elements?in?the?array?divided?by?4.?
- ?*/??
- size_t?globalWorksize[1]?=?{arraySize?/?4};??
- /*?Enqueue?the?kernel?*/??
- if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
- {??
- ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
- ????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
- ????return?1;??
- }??
5 運行OpenCL樣例
(1). 在SDK根目錄的命令行提示符中
- cd?samples\hello_world_vector??
- cs-make?install??
(2) . 拷貝bin文件夾到目標板中
(3). 在板子上導航到該目錄,運行hello world二進制文件
- chmod?777?hello_world_vector??
- ./hello_world_vector?