1、算法簡述

　　實現矩陣相加：Cn = An + Bn。這個例子雖然很簡單，但是由于矩陣元素之間相互獨立，每個元素可以非常容易地進行并行計算，可以非常理想地在OpenCL中實現。

2. C/C++實現

[cpp] view plaincopyprint?

/*?
?*?This?confidential?and?proprietary?software?may?be?used?only?as?
?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
?*????(C)?COPYRIGHT?2013?ARM?Limited?
?*????????ALL?RIGHTS?RESERVED?
?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
?*?by?a?licensing?agreement?from?ARM?Limited.?
?*/??
??
#include?<iostream>??
??
using?namespace?std;??
??
/**?
?*?\brief?Basic?integer?array?addition?implemented?in?C/C++.?
?*?\details?A?sample?which?shows?how?to?add?two?integer?arrays?and?store?the?result?in?a?third?array.?
?*??????????No?OpenCL?code?is?used?in?this?sample,?only?standard?C/C++.?The?code?executes?only?on?the?CPU.?
?*?\return?The?exit?code?of?the?application,?non-zero?if?a?problem?occurred.?
?*/??
int?main(void)??
{??
????/*?[Setup?memory]?*/??
????/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
????int?arraySize?=?1000000;??
??
????/*?Arrays?to?hold?the?input?and?output?data.?*/??
????int*?inputA?=?new?int[arraySize];??
????int*?inputB?=?new?int[arraySize];??
????int*?output?=?new?int[arraySize];??
????/*?[Setup?memory]?*/??
??
????/*?Fill?the?arrays?with?data.?*/??
????for?(int?i?=?0;?i?<?arraySize;?i++)??
????{??
????????inputA[i]?=?i;??
????????inputB[i]?=?i;??
????}??
??
????/*?[C/C++?Implementation]?*/??
????for?(int?i?=?0;?i?<?arraySize;?i++)??
????{??
????????output[i]?=?inputA[i]?+?inputB[i];??
????}??
????/*?[C/C++?Implementation]?*/??
??
????/*?Uncomment?the?following?block?to?print?results.?*/??
????/*?
????for?(int?i?=?0;?i?<?arraySize;?i++)?
????{?
????????cout?<<?"i?=?"?<<?i?<<?",?output?=?"?<<??output[i]?<<?"\n";?
????}?
????*/??
??
????delete[]?inputA;??
????delete[]?inputB;??
????delete[]?output;??
}??

3 Open基本實現

3.1 內核代碼實現

　　內核代碼的實現如下，其中指針的修飾符restrict是C99中的關鍵字，只用于限定指針。該關鍵字用于告知編譯器，所有修改該指針所指向內容的操作全部都是基于該指針的，即不存在其它進行修改操作的途徑；這樣的后果是幫助編譯器進行更好的代碼優化，生成更有效率的匯編代碼。

[cpp] view plaincopyprint?

/*?
?*?This?confidential?and?proprietary?software?may?be?used?only?as?
?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
?*????(C)?COPYRIGHT?2013?ARM?Limited?
?*????????ALL?RIGHTS?RESERVED?
?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
?*?by?a?licensing?agreement?from?ARM?Limited.?
?*/??
??
/**?
?*?\brief?Hello?World?kernel?function.?
?*?\param[in]?inputA?First?input?array.?
?*?\param[in]?inputB?Second?input?array.?
?*?\param[out]?output?Output?array.?
?*/??
/*?[OpenCL?Implementation]?*/??
__kernel?void?hello_world_opencl(__global?int*?restrict?inputA,??
?????????????????????????????????__global?int*?restrict?inputB,??
?????????????????????????????????__global?int*?restrict?output)??
{??
????/*?
?????*?Set?i?to?be?the?ID?of?the?kernel?instance.?
?????*?If?the?global?work?size?(set?by?clEnqueueNDRangeKernel)?is?n,?
?????*?then?n?kernels?will?be?run?and?i?will?be?in?the?range?[0,?n?-?1].?
?????*/??
????int?i?=?get_global_id(0);??
??
????/*?Use?i?as?an?index?into?the?three?arrays.?*/??
????output[i]?=?inputA[i]?+?inputB[i];??
}??
/*?[OpenCL?Implementation]?*/??

3.2 宿主機代碼實現

　　內核代碼中并沒有循環語句，只計算一個矩陣元素的值，每一個實例獲得一個獨一無二的所以需要運行的內核實例數目等同于矩陣元素個數。

[cpp] view plaincopyprint?

/*?
????*?Each?instance?of?our?OpenCL?kernel?operates?on?a?single?element?of?each?array?so?the?number?of?
????*?instances?needed?is?the?number?of?elements?in?the?array.?
????*/??
???size_t?globalWorksize[1]?=?{arraySize};??
???/*?Enqueue?the?kernel?*/??
???if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
???{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
???}??

　　因為我們并沒有設置內核間的依賴性，OpenCL設備可以用并行的方式自由地運行內核實例。現在并行化上的唯一限制是設備的容量。在前面的代碼運行之前，需要建立OpenCL，下面分別介紹與建立OpenCL相關的各項內容。

　　因為現在的操作是在GPU而不是CPU中，我們需要知道任何使用數據的位置。知道數據是在GPU內存空間還是CPU內存空間是非常重要的。在桌面系統中，GPU和CPU有它們自己的內存空間，被相對低速率的總線分開，這意味著在GPU和CPU之間共享數據是一個代價高昂的操作。在大多數帶Mali-T600系列GPU的嵌入式系統中，GPU和CPU共享同一個內存，因此這使得以相對低的代價共享GPU和CPU之間內存成為可能。

　　由于這些系統的差異，OpenCL支持多種分配和共享設備間內存的方式。下面是一種共享設備間內存的方式，目的是減少從一個設備到另一個設備的內存拷貝(在一個共享內存系統中)。

a. 要求OpenCL設備分配內存

　　在C/C++實現中，我們使用數組來分配內存。

[cpp] view plaincopyprint?

/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
int?arraySize?=?1000000;??
/*?Arrays?to?hold?the?input?and?output?data.?*/??
int*?inputA?=?new?int[arraySize];??
int*?inputB?=?new?int[arraySize];??
int*?output?=?new?int[arraySize];??

　　在OpenCL中，我們使用內存緩沖區。內存緩沖區其實是一定大小的內存塊。為了分配緩沖區，我們如下做：

[cpp] view plaincopyprint?

/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
cl_int?arraySize?=?1000000;??
/*?The?buffers?are?the?size?of?the?arrays.?*/??
size_t?bufferSize?=?arraySize?*?sizeof(cl_int);??
/*?
?*?Ask?the?OpenCL?implementation?to?allocate?buffers?for?the?data.?
?*?We?ask?the?OpenCL?implemenation?to?allocate?memory?rather?than?allocating?
?*?it?on?the?CPU?to?avoid?having?to?copy?the?data?later.?
?*?The?read/write?flags?relate?to?accesses?to?the?memory?from?within?the?kernel.?
?*/??
bool?createMemoryObjectsSuccess?=?true;??
memoryObjects[0]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
memoryObjects[1]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
memoryObjects[2]?=?clCreateBuffer(context,?CL_MEM_WRITE_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
if?(!createMemoryObjectsSuccess)??
{??
????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????cerr?<<?"Failed?to?create?OpenCL?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????return?1;??
}??

　　盡管這看上去更加復雜，但其實這里只有三個OpenCL API調用。唯一的區別是這里我們檢查錯誤(這是一個好的做法)，而C++中并不用做。

b. 映射內存到局部指針

　　現在內存已分配，但是只有OpenCL實現知道它的位置。為了訪問CPU上的內存，我們把它們映射到一個指針。

[cpp] view plaincopyprint?

/*?Map?the?memory?buffers?created?by?the?OpenCL?implementation?to?pointers?so?we?can?access?them?on?the?CPU.?*/??
bool?mapMemoryObjectsSuccess?=?true;??
cl_int*?inputA?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[0],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
cl_int*?inputB?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[1],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
if?(!mapMemoryObjectsSuccess)??
{??
???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???return?1;??
}??

　　現在這些指針可以想普通的C/C++指針那樣使用了。

c. 在CPU上初始化數據

　　因為我們已有了指向內存的指針，這一步與在CPU上一樣。

[cpp] view plaincopyprint?

for?(int?i?=?0;?i?<?arraySize;?i++)??
{??
???inputA[i]?=?i;??
???inputB[i]?=?i;??
}??

d. 取消映射緩沖區

　　為了使OpenCL設備使用緩沖區，我們必須把它們在CPU上的映射取消。

[cpp] view plaincopyprint?

/*?
?*?Unmap?the?memory?objects?as?we?have?finished?using?them?from?the?CPU?side.?
?*?We?unmap?the?memory?because?otherwise:?
?*?-?reads?and?writes?to?that?memory?from?inside?a?kernel?on?the?OpenCL?side?are?undefined.?
?*?-?the?OpenCL?implementation?cannot?free?the?memory?when?it?is?finished.?
?*/??
if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[0],?inputA,?0,?NULL,?NULL)))??
{??
???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???return?1;??
}??
if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[1],?inputB,?0,?NULL,?NULL)))??
{??
???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???return?1;??
}??

e. 映射數據到內核

　　在我們調度內核運行之前，我們必須告訴內核哪些數據作為輸入使用。這里，我們映射內存對象到OpenCL內核函數的參數中。

[cpp] view plaincopyprint?

bool?setKernelArgumentsSuccess?=?true;??
setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?0,?sizeof(cl_mem),?&memoryObjects[0]));??
setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?1,?sizeof(cl_mem),?&memoryObjects[1]));??
setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?2,?sizeof(cl_mem),?&memoryObjects[2]));??
if?(!setKernelArgumentsSuccess)??
{??
????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????cerr?<<?"Failed?setting?OpenCL?kernel?arguments.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????return?1;??
}??

f. 運行內核

　　對于內核代碼見前面，如何調度它則不作詳述。

g. 獲取運行結果

　　一旦計算結束，我們像映射輸入緩沖區那樣映射輸出緩沖區。然后，我們就可以使用指針讀取結果數據，然后取消緩沖區映射，就像前面那樣。

　　基本實現的宿主機的完整代碼如下：

[cpp] view plaincopyprint?

/*?
?*?This?confidential?and?proprietary?software?may?be?used?only?as?
?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
?*????(C)?COPYRIGHT?2013?ARM?Limited?
?*????????ALL?RIGHTS?RESERVED?
?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
?*?by?a?licensing?agreement?from?ARM?Limited.?
?*/??
??
#include?"common.h"??
#include?"image.h"??
??
#include?<CL/cl.h>??
#include?<iostream>??
??
using?namespace?std;??
??
/**?
?*?\brief?Basic?integer?array?addition?implemented?in?OpenCL.?
?*?\details?A?sample?which?shows?how?to?add?two?integer?arrays?and?store?the?result?in?a?third?array.?
?*??????????The?main?calculation?code?is?in?an?OpenCL?kernel?which?is?executed?on?a?GPU?device.?
?*?\return?The?exit?code?of?the?application,?non-zero?if?a?problem?occurred.?
?*/??
int?main(void)??
{??
????cl_context?context?=?0;??
????cl_command_queue?commandQueue?=?0;??
????cl_program?program?=?0;??
????cl_device_id?device?=?0;??
????cl_kernel?kernel?=?0;??
????int?numberOfMemoryObjects?=?3;??
????cl_mem?memoryObjects[3]?=?{0,?0,?0};??
????cl_int?errorNumber;??
??
????if?(!createContext(&context))??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?to?create?an?OpenCL?context.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
??
????if?(!createCommandQueue(context,?&commandQueue,?&device))??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?to?create?the?OpenCL?command?queue.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
??
????if?(!createProgram(context,?device,?"assets/hello_world_opencl.cl",?&program))??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?to?create?OpenCL?program."?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
??
????kernel?=?clCreateKernel(program,?"hello_world_opencl",?&errorNumber);??
????if?(!checkSuccess(errorNumber))??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?to?create?OpenCL?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
??
????/*?[Setup?memory]?*/??
????/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
????cl_int?arraySize?=?1000000;??
??
????/*?The?buffers?are?the?size?of?the?arrays.?*/??
????size_t?bufferSize?=?arraySize?*?sizeof(cl_int);??
??
????/*?
?????*?Ask?the?OpenCL?implementation?to?allocate?buffers?for?the?data.?
?????*?We?ask?the?OpenCL?implemenation?to?allocate?memory?rather?than?allocating?
?????*?it?on?the?CPU?to?avoid?having?to?copy?the?data?later.?
?????*?The?read/write?flags?relate?to?accesses?to?the?memory?from?within?the?kernel.?
?????*/??
????bool?createMemoryObjectsSuccess?=?true;??
??
????memoryObjects[0]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
??
????memoryObjects[1]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
??
????memoryObjects[2]?=?clCreateBuffer(context,?CL_MEM_WRITE_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
??
????if?(!createMemoryObjectsSuccess)??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?to?create?OpenCL?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
????/*?[Setup?memory]?*/??
??
????/*?[Map?the?buffers?to?pointers]?*/??
????/*?Map?the?memory?buffers?created?by?the?OpenCL?implementation?to?pointers?so?we?can?access?them?on?the?CPU.?*/??
????bool?mapMemoryObjectsSuccess?=?true;??
??
????cl_int*?inputA?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[0],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
????mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
??
????cl_int*?inputB?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[1],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
????mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
??
????if?(!mapMemoryObjectsSuccess)??
????{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
????}??
????/*?[Map?the?buffers?to?pointers]?*/??
??
????/*?[Initialize?the?input?data]?*/??
????for?(int?i?=?0;?i?<?arraySize;?i++)??
????{??
???????inputA[i]?=?i;??
???????inputB[i]?=?i;??
????}??
????/*?[Initialize?the?input?data]?*/??
??
????/*?[Un-map?the?buffers]?*/??
????/*?
?????*?Unmap?the?memory?objects?as?we?have?finished?using?them?from?the?CPU?side.?
?????*?We?unmap?the?memory?because?otherwise:?
?????*?-?reads?and?writes?to?that?memory?from?inside?a?kernel?on?the?OpenCL?side?are?undefined.?
?????*?-?the?OpenCL?implementation?cannot?free?the?memory?when?it?is?finished.?
?????*/??
????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[0],?inputA,?0,?NULL,?NULL)))??
????{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
????}??
??
????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[1],?inputB,?0,?NULL,?NULL)))??
????{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
????}??
????/*?[Un-map?the?buffers]?*/??
??
????/*?[Set?the?kernel?arguments]?*/??
????bool?setKernelArgumentsSuccess?=?true;??
????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?0,?sizeof(cl_mem),?&memoryObjects[0]));??
????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?1,?sizeof(cl_mem),?&memoryObjects[1]));??
????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?2,?sizeof(cl_mem),?&memoryObjects[2]));??
??
????if?(!setKernelArgumentsSuccess)??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?setting?OpenCL?kernel?arguments.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
????/*?[Set?the?kernel?arguments]?*/??
??
????/*?An?event?to?associate?with?the?Kernel.?Allows?us?to?retrieve?profiling?information?later.?*/??
????cl_event?event?=?0;??
??
????/*?[Global?work?size]?*/??
????/*?
?????*?Each?instance?of?our?OpenCL?kernel?operates?on?a?single?element?of?each?array?so?the?number?of?
?????*?instances?needed?is?the?number?of?elements?in?the?array.?
?????*/??
????size_t?globalWorksize[1]?=?{arraySize};??
????/*?Enqueue?the?kernel?*/??
????if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
????/*?[Global?work?size]?*/??
??
????/*?Wait?for?kernel?execution?completion.?*/??
????if?(!checkSuccess(clFinish(commandQueue)))??
????{??
????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????????cerr?<<?"Failed?waiting?for?kernel?execution?to?finish.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????????return?1;??
????}??
??
????/*?Print?the?profiling?information?for?the?event.?*/??
????printProfilingInfo(event);??
????/*?Release?the?event?object.?*/??
????if?(!checkSuccess(clReleaseEvent(event)))??
????{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Failed?releasing?the?event?object.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
????}??
??
????/*?Get?a?pointer?to?the?output?data.?*/??
????cl_int*?output?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[2],?CL_TRUE,?CL_MAP_READ,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
????if?(!checkSuccess(errorNumber))??
????{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
????}??
??
????/*?[Output?the?results]?*/??
????/*?Uncomment?the?following?block?to?print?results.?*/??
????/*?
????for?(int?i?=?0;?i?<?arraySize;?i++)?
????{?
????????cout?<<?"i?=?"?<<?i?<<?",?output?=?"?<<??output[i]?<<?"\n";?
????}?
????*/??
????/*?[Output?the?results]?*/??
??
????/*?Unmap?the?memory?object?as?we?are?finished?using?them?from?the?CPU?side.?*/??
????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[2],?output,?0,?NULL,?NULL)))??
????{??
???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
???????return?1;??
????}??
??
????/*?Release?OpenCL?objects.?*/??
????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
}??

4 向量化你的OpenCL代碼

4.1 向量基礎

　　OpenCL設備可以通告它們為不同數據類型的首選向量寬度，你可以使用這個信息來選擇一個內核。結果是，相當于該內核為你正在運行的平臺做了優化。例如，一個設備可能僅有標量整數的硬件支持，而另一個設備則有寬度為4的整數向量的硬件支持。可以寫兩個版本的內核，一個用于標量，一個用于向量，在運行時選擇正確的版本。

　　這里是一個在特定設備上詢問首選整數向量寬度的例子。

[cpp] view plaincopyprint?

/*?
?*?Query?the?device?to?find?out?it's?prefered?integer?vector?width.?
?*?Although?we?are?only?printing?the?value?here,?it?can?be?used?to?select?between?
?*?different?versions?of?a?kernel.?
?*/??
cl_uint?integerVectorWidth;??
clGetDeviceInfo(device,?CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT,?sizeof(cl_uint),?&integerVectorWidth,?NULL);??
cout?<<?"Prefered?vector?width?for?integers:?"?<<?integerVectorWidth?<<?endl;??

對于其它OpenCL數據類型也是一樣的。

　　每一個Mali T600系列GPU核最少有兩個128位寬度的ALU(算數邏輯單元)，它們具有矢量計算能力。ALU中的絕大多數操作(例如，浮點加，浮點乘，整數加，整數乘)，可以以128位向量數據操作(例如，char16, short8, int4, float4)。使用前面講述的詢問方法來為你的數據類型決定使用正確的向量大小。

　　當使用Mali T600系列GPU時，我們推薦在任何可能的地方使用向量。

4.2 向量化代碼

　　首先，修改內核代碼以支持向量運算。對于Mali T600系列GPU來說，一個向量運算的時間與一個整數加法的時間是一樣的。具體代碼解讀，見下面代碼中的注釋部分。

[cpp] view plaincopyprint?

__kernel?void?hello_world_vector(__global?int*?restrict?inputA,??
?????????????????????????????????__global?int*?restrict?inputB,??
?????????????????????????????????__global?int*?restrict?output)??
{??
????/*?
?????*?We?have?reduced?the?global?work?size?(n)?by?a?factor?of?4?compared?to?the?hello_world_opencl?sample.?
?????*?Therefore,?i?will?now?be?in?the?range?[0,?(n?/?4)?-?1].?
?????*/??
????int?i?=?get_global_id(0);??
????/*?
?????*?Load?4?integers?into?'a'.?
?????*?The?offset?calculation?is?implicit?from?the?size?of?the?vector?load.?
?????*?For?vloadN(i,?p),?the?address?of?the?first?data?loaded?would?be?p?+?i?*?N.?
?????*?Load?from?the?data?from?the?address:?inputA?+?i?*?4.?
?????*/??
????int4?a?=?vload4(i,?inputA);??
????/*?Do?the?same?for?inputB?*/??
????int4?b?=?vload4(i,?inputB);??
????/*?
?????*?Do?the?vector?addition.?
?????*?Store?the?result?at?the?address:?output?+?i?*?4.?
?????*/??
????vstore4(a?+?b,?i,?output);??
}??

　　由于現在每個內核實例能夠實現多個加法運算，所以必須減少內核實例的數量，在宿主機代碼中的修改部分如下所示。

[cpp] view plaincopyprint?

/*?
?*?Each?instance?of?our?OpenCL?kernel?now?operates?on?4?elements?of?each?array?so?the?number?of?
?*?instances?needed?is?the?number?of?elements?in?the?array?divided?by?4.?
?*/??
size_t?globalWorksize[1]?=?{arraySize?/?4};??
/*?Enqueue?the?kernel?*/??
if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
{??
????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
????return?1;??
}??