《Mali OpenCL SDK v1.1.0》教程樣例之一“Hello World”

1、算法簡述


  實現矩陣相加:Cn = An + Bn。這個例子雖然很簡單,但是由于矩陣元素之間相互獨立,每個元素可以非常容易地進行并行計算,可以非常理想地在OpenCL中實現。



2. C/C++實現

  

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ?*?This?confidential?and?proprietary?software?may?be?used?only?as?
  3. ?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
  4. ?*????(C)?COPYRIGHT?2013?ARM?Limited?
  5. ?*????????ALL?RIGHTS?RESERVED?
  6. ?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
  7. ?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
  8. ?*?by?a?licensing?agreement?from?ARM?Limited.?
  9. ?*/??
  10. ??
  11. #include?<iostream>??
  12. ??
  13. using?namespace?std;??
  14. ??
  15. /**?
  16. ?*?\brief?Basic?integer?array?addition?implemented?in?C/C++.?
  17. ?*?\details?A?sample?which?shows?how?to?add?two?integer?arrays?and?store?the?result?in?a?third?array.?
  18. ?*??????????No?OpenCL?code?is?used?in?this?sample,?only?standard?C/C++.?The?code?executes?only?on?the?CPU.?
  19. ?*?\return?The?exit?code?of?the?application,?non-zero?if?a?problem?occurred.?
  20. ?*/??
  21. int?main(void)??
  22. {??
  23. ????/*?[Setup?memory]?*/??
  24. ????/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
  25. ????int?arraySize?=?1000000;??
  26. ??
  27. ????/*?Arrays?to?hold?the?input?and?output?data.?*/??
  28. ????int*?inputA?=?new?int[arraySize];??
  29. ????int*?inputB?=?new?int[arraySize];??
  30. ????int*?output?=?new?int[arraySize];??
  31. ????/*?[Setup?memory]?*/??
  32. ??
  33. ????/*?Fill?the?arrays?with?data.?*/??
  34. ????for?(int?i?=?0;?i?<?arraySize;?i++)??
  35. ????{??
  36. ????????inputA[i]?=?i;??
  37. ????????inputB[i]?=?i;??
  38. ????}??
  39. ??
  40. ????/*?[C/C++?Implementation]?*/??
  41. ????for?(int?i?=?0;?i?<?arraySize;?i++)??
  42. ????{??
  43. ????????output[i]?=?inputA[i]?+?inputB[i];??
  44. ????}??
  45. ????/*?[C/C++?Implementation]?*/??
  46. ??
  47. ????/*?Uncomment?the?following?block?to?print?results.?*/??
  48. ????/*?
  49. ????for?(int?i?=?0;?i?<?arraySize;?i++)?
  50. ????{?
  51. ????????cout?<<?"i?=?"?<<?i?<<?",?output?=?"?<<??output[i]?<<?"\n";?
  52. ????}?
  53. ????*/??
  54. ??
  55. ????delete[]?inputA;??
  56. ????delete[]?inputB;??
  57. ????delete[]?output;??
  58. }??


3 Open基本實現


3.1 內核代碼實現


  內核代碼的實現如下,其中指針的修飾符restrictC99中的關鍵字,只用于限定指針。該關鍵字用于告知編譯器,所有修改該指針所指向內容的操作全部都是基于該指針,即不存在其它進行修改操作的途徑;這樣的后果是幫助編譯器進行更好的代碼優化,生成更有效率的匯編代碼。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ?*?This?confidential?and?proprietary?software?may?be?used?only?as?
  3. ?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
  4. ?*????(C)?COPYRIGHT?2013?ARM?Limited?
  5. ?*????????ALL?RIGHTS?RESERVED?
  6. ?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
  7. ?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
  8. ?*?by?a?licensing?agreement?from?ARM?Limited.?
  9. ?*/??
  10. ??
  11. /**?
  12. ?*?\brief?Hello?World?kernel?function.?
  13. ?*?\param[in]?inputA?First?input?array.?
  14. ?*?\param[in]?inputB?Second?input?array.?
  15. ?*?\param[out]?output?Output?array.?
  16. ?*/??
  17. /*?[OpenCL?Implementation]?*/??
  18. __kernel?void?hello_world_opencl(__global?int*?restrict?inputA,??
  19. ?????????????????????????????????__global?int*?restrict?inputB,??
  20. ?????????????????????????????????__global?int*?restrict?output)??
  21. {??
  22. ????/*?
  23. ?????*?Set?i?to?be?the?ID?of?the?kernel?instance.?
  24. ?????*?If?the?global?work?size?(set?by?clEnqueueNDRangeKernel)?is?n,?
  25. ?????*?then?n?kernels?will?be?run?and?i?will?be?in?the?range?[0,?n?-?1].?
  26. ?????*/??
  27. ????int?i?=?get_global_id(0);??
  28. ??
  29. ????/*?Use?i?as?an?index?into?the?three?arrays.?*/??
  30. ????output[i]?=?inputA[i]?+?inputB[i];??
  31. }??
  32. /*?[OpenCL?Implementation]?*/??

3.2 宿主機代碼實現


  內核代碼中并沒有循環語句,只計算一個矩陣元素的值,每一個實例獲得一個獨一無二的所以需要運行的內核實例數目等同于矩陣元素個數。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ????*?Each?instance?of?our?OpenCL?kernel?operates?on?a?single?element?of?each?array?so?the?number?of?
  3. ????*?instances?needed?is?the?number?of?elements?in?the?array.?
  4. ????*/??
  5. ???size_t?globalWorksize[1]?=?{arraySize};??
  6. ???/*?Enqueue?the?kernel?*/??
  7. ???if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
  8. ???{??
  9. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  10. ???????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  11. ???????return?1;??
  12. ???}??

  因為我們并沒有設置內核間的依賴性,OpenCL設備可以用并行的方式自由地運行內核實例。現在并行化上的唯一限制是設備的容量。在前面的代碼運行之前,需要建立OpenCL,下面分別介紹與建立OpenCL相關的各項內容。


  因為現在的操作是在GPU而不是CPU中,我們需要知道任何使用數據的位置。知道數據是在GPU內存空間還是CPU內存空間是非常重要的。在桌面系統中,GPU和CPU有它們自己的內存空間,被相對低速率的總線分開,這意味著在GPU和CPU之間共享數據是一個代價高昂的操作。在大多數帶Mali-T600系列GPU的嵌入式系統中GPU和CPU共享同一個內存,因此這使得以相對低的代價共享GPU和CPU之間內存成為可能。


  由于這些系統的差異,OpenCL支持多種分配和共享設備間內存的方式。下面是一種共享設備間內存的方式,目的是減少從一個設備到另一個設備的內存拷貝(在一個共享內存系統中)。


a. 要求OpenCL設備分配內存


  在C/C++實現中,我們使用數組來分配內存。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
  2. int?arraySize?=?1000000;??
  3. /*?Arrays?to?hold?the?input?and?output?data.?*/??
  4. int*?inputA?=?new?int[arraySize];??
  5. int*?inputB?=?new?int[arraySize];??
  6. int*?output?=?new?int[arraySize];??
   在OpenCL中,我們使用內存緩沖區。內存緩沖區其實是一定大小的內存塊。為了分配緩沖區,我們如下做:

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
  2. cl_int?arraySize?=?1000000;??
  3. /*?The?buffers?are?the?size?of?the?arrays.?*/??
  4. size_t?bufferSize?=?arraySize?*?sizeof(cl_int);??
  5. /*?
  6. ?*?Ask?the?OpenCL?implementation?to?allocate?buffers?for?the?data.?
  7. ?*?We?ask?the?OpenCL?implemenation?to?allocate?memory?rather?than?allocating?
  8. ?*?it?on?the?CPU?to?avoid?having?to?copy?the?data?later.?
  9. ?*?The?read/write?flags?relate?to?accesses?to?the?memory?from?within?the?kernel.?
  10. ?*/??
  11. bool?createMemoryObjectsSuccess?=?true;??
  12. memoryObjects[0]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
  13. createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  14. memoryObjects[1]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
  15. createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  16. memoryObjects[2]?=?clCreateBuffer(context,?CL_MEM_WRITE_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
  17. createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  18. if?(!createMemoryObjectsSuccess)??
  19. {??
  20. ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  21. ????cerr?<<?"Failed?to?create?OpenCL?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  22. ????return?1;??
  23. }??
   盡管這看上去更加復雜,但其實這里只有三個OpenCL API調用。唯一的區別是這里我們檢查錯誤(這是一個好的做法),而C++中并不用做。

b. 映射內存到局部指針


  現在內存已分配,但是只有OpenCL實現知道它的位置。為了訪問CPU上的內存,我們把它們映射到一個指針。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?Map?the?memory?buffers?created?by?the?OpenCL?implementation?to?pointers?so?we?can?access?them?on?the?CPU.?*/??
  2. bool?mapMemoryObjectsSuccess?=?true;??
  3. cl_int*?inputA?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[0],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
  4. mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  5. cl_int*?inputB?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[1],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
  6. mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  7. if?(!mapMemoryObjectsSuccess)??
  8. {??
  9. ???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  10. ???cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  11. ???return?1;??
  12. }??

  現在這些指針可以想普通的C/C++指針那樣使用了。


c. 在CPU上初始化數據


  因為我們已有了指向內存的指針,這一步與在CPU上一樣。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. for?(int?i?=?0;?i?<?arraySize;?i++)??
  2. {??
  3. ???inputA[i]?=?i;??
  4. ???inputB[i]?=?i;??
  5. }??

d. 取消映射緩沖區


  為了使OpenCL設備使用緩沖區,我們必須把它們在CPU上的映射取消。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ?*?Unmap?the?memory?objects?as?we?have?finished?using?them?from?the?CPU?side.?
  3. ?*?We?unmap?the?memory?because?otherwise:?
  4. ?*?-?reads?and?writes?to?that?memory?from?inside?a?kernel?on?the?OpenCL?side?are?undefined.?
  5. ?*?-?the?OpenCL?implementation?cannot?free?the?memory?when?it?is?finished.?
  6. ?*/??
  7. if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[0],?inputA,?0,?NULL,?NULL)))??
  8. {??
  9. ???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  10. ???cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  11. ???return?1;??
  12. }??
  13. if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[1],?inputB,?0,?NULL,?NULL)))??
  14. {??
  15. ???cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  16. ???cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  17. ???return?1;??
  18. }??

e. 映射數據到內核


  在我們調度內核運行之前,我們必須告訴內核哪些數據作為輸入使用。這里,我們映射內存對象到OpenCL內核函數的參數中。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. bool?setKernelArgumentsSuccess?=?true;??
  2. setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?0,?sizeof(cl_mem),?&memoryObjects[0]));??
  3. setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?1,?sizeof(cl_mem),?&memoryObjects[1]));??
  4. setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?2,?sizeof(cl_mem),?&memoryObjects[2]));??
  5. if?(!setKernelArgumentsSuccess)??
  6. {??
  7. ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  8. ????cerr?<<?"Failed?setting?OpenCL?kernel?arguments.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  9. ????return?1;??
  10. }??

f. 運行內核


  對于內核代碼見前面,如何調度它則不作詳述。


g. 獲取運行結果


  一旦計算結束,我們像映射輸入緩沖區那樣映射輸出緩沖區。然后,我們就可以使用指針讀取結果數據,然后取消緩沖區映射,就像前面那樣。


  基本實現的宿主機的完整代碼如下:

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ?*?This?confidential?and?proprietary?software?may?be?used?only?as?
  3. ?*?authorised?by?a?licensing?agreement?from?ARM?Limited?
  4. ?*????(C)?COPYRIGHT?2013?ARM?Limited?
  5. ?*????????ALL?RIGHTS?RESERVED?
  6. ?*?The?entire?notice?above?must?be?reproduced?on?all?authorised?
  7. ?*?copies?and?copies?may?only?be?made?to?the?extent?permitted?
  8. ?*?by?a?licensing?agreement?from?ARM?Limited.?
  9. ?*/??
  10. ??
  11. #include?"common.h"??
  12. #include?"image.h"??
  13. ??
  14. #include?<CL/cl.h>??
  15. #include?<iostream>??
  16. ??
  17. using?namespace?std;??
  18. ??
  19. /**?
  20. ?*?\brief?Basic?integer?array?addition?implemented?in?OpenCL.?
  21. ?*?\details?A?sample?which?shows?how?to?add?two?integer?arrays?and?store?the?result?in?a?third?array.?
  22. ?*??????????The?main?calculation?code?is?in?an?OpenCL?kernel?which?is?executed?on?a?GPU?device.?
  23. ?*?\return?The?exit?code?of?the?application,?non-zero?if?a?problem?occurred.?
  24. ?*/??
  25. int?main(void)??
  26. {??
  27. ????cl_context?context?=?0;??
  28. ????cl_command_queue?commandQueue?=?0;??
  29. ????cl_program?program?=?0;??
  30. ????cl_device_id?device?=?0;??
  31. ????cl_kernel?kernel?=?0;??
  32. ????int?numberOfMemoryObjects?=?3;??
  33. ????cl_mem?memoryObjects[3]?=?{0,?0,?0};??
  34. ????cl_int?errorNumber;??
  35. ??
  36. ????if?(!createContext(&context))??
  37. ????{??
  38. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  39. ????????cerr?<<?"Failed?to?create?an?OpenCL?context.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  40. ????????return?1;??
  41. ????}??
  42. ??
  43. ????if?(!createCommandQueue(context,?&commandQueue,?&device))??
  44. ????{??
  45. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  46. ????????cerr?<<?"Failed?to?create?the?OpenCL?command?queue.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  47. ????????return?1;??
  48. ????}??
  49. ??
  50. ????if?(!createProgram(context,?device,?"assets/hello_world_opencl.cl",?&program))??
  51. ????{??
  52. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  53. ????????cerr?<<?"Failed?to?create?OpenCL?program."?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  54. ????????return?1;??
  55. ????}??
  56. ??
  57. ????kernel?=?clCreateKernel(program,?"hello_world_opencl",?&errorNumber);??
  58. ????if?(!checkSuccess(errorNumber))??
  59. ????{??
  60. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  61. ????????cerr?<<?"Failed?to?create?OpenCL?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  62. ????????return?1;??
  63. ????}??
  64. ??
  65. ????/*?[Setup?memory]?*/??
  66. ????/*?Number?of?elements?in?the?arrays?of?input?and?output?data.?*/??
  67. ????cl_int?arraySize?=?1000000;??
  68. ??
  69. ????/*?The?buffers?are?the?size?of?the?arrays.?*/??
  70. ????size_t?bufferSize?=?arraySize?*?sizeof(cl_int);??
  71. ??
  72. ????/*?
  73. ?????*?Ask?the?OpenCL?implementation?to?allocate?buffers?for?the?data.?
  74. ?????*?We?ask?the?OpenCL?implemenation?to?allocate?memory?rather?than?allocating?
  75. ?????*?it?on?the?CPU?to?avoid?having?to?copy?the?data?later.?
  76. ?????*?The?read/write?flags?relate?to?accesses?to?the?memory?from?within?the?kernel.?
  77. ?????*/??
  78. ????bool?createMemoryObjectsSuccess?=?true;??
  79. ??
  80. ????memoryObjects[0]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
  81. ????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  82. ??
  83. ????memoryObjects[1]?=?clCreateBuffer(context,?CL_MEM_READ_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
  84. ????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  85. ??
  86. ????memoryObjects[2]?=?clCreateBuffer(context,?CL_MEM_WRITE_ONLY?|?CL_MEM_ALLOC_HOST_PTR,?bufferSize,?NULL,?&errorNumber);??
  87. ????createMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  88. ??
  89. ????if?(!createMemoryObjectsSuccess)??
  90. ????{??
  91. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  92. ????????cerr?<<?"Failed?to?create?OpenCL?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  93. ????????return?1;??
  94. ????}??
  95. ????/*?[Setup?memory]?*/??
  96. ??
  97. ????/*?[Map?the?buffers?to?pointers]?*/??
  98. ????/*?Map?the?memory?buffers?created?by?the?OpenCL?implementation?to?pointers?so?we?can?access?them?on?the?CPU.?*/??
  99. ????bool?mapMemoryObjectsSuccess?=?true;??
  100. ??
  101. ????cl_int*?inputA?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[0],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
  102. ????mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  103. ??
  104. ????cl_int*?inputB?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[1],?CL_TRUE,?CL_MAP_WRITE,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
  105. ????mapMemoryObjectsSuccess?&=?checkSuccess(errorNumber);??
  106. ??
  107. ????if?(!mapMemoryObjectsSuccess)??
  108. ????{??
  109. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  110. ???????cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  111. ???????return?1;??
  112. ????}??
  113. ????/*?[Map?the?buffers?to?pointers]?*/??
  114. ??
  115. ????/*?[Initialize?the?input?data]?*/??
  116. ????for?(int?i?=?0;?i?<?arraySize;?i++)??
  117. ????{??
  118. ???????inputA[i]?=?i;??
  119. ???????inputB[i]?=?i;??
  120. ????}??
  121. ????/*?[Initialize?the?input?data]?*/??
  122. ??
  123. ????/*?[Un-map?the?buffers]?*/??
  124. ????/*?
  125. ?????*?Unmap?the?memory?objects?as?we?have?finished?using?them?from?the?CPU?side.?
  126. ?????*?We?unmap?the?memory?because?otherwise:?
  127. ?????*?-?reads?and?writes?to?that?memory?from?inside?a?kernel?on?the?OpenCL?side?are?undefined.?
  128. ?????*?-?the?OpenCL?implementation?cannot?free?the?memory?when?it?is?finished.?
  129. ?????*/??
  130. ????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[0],?inputA,?0,?NULL,?NULL)))??
  131. ????{??
  132. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  133. ???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  134. ???????return?1;??
  135. ????}??
  136. ??
  137. ????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[1],?inputB,?0,?NULL,?NULL)))??
  138. ????{??
  139. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  140. ???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  141. ???????return?1;??
  142. ????}??
  143. ????/*?[Un-map?the?buffers]?*/??
  144. ??
  145. ????/*?[Set?the?kernel?arguments]?*/??
  146. ????bool?setKernelArgumentsSuccess?=?true;??
  147. ????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?0,?sizeof(cl_mem),?&memoryObjects[0]));??
  148. ????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?1,?sizeof(cl_mem),?&memoryObjects[1]));??
  149. ????setKernelArgumentsSuccess?&=?checkSuccess(clSetKernelArg(kernel,?2,?sizeof(cl_mem),?&memoryObjects[2]));??
  150. ??
  151. ????if?(!setKernelArgumentsSuccess)??
  152. ????{??
  153. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  154. ????????cerr?<<?"Failed?setting?OpenCL?kernel?arguments.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  155. ????????return?1;??
  156. ????}??
  157. ????/*?[Set?the?kernel?arguments]?*/??
  158. ??
  159. ????/*?An?event?to?associate?with?the?Kernel.?Allows?us?to?retrieve?profiling?information?later.?*/??
  160. ????cl_event?event?=?0;??
  161. ??
  162. ????/*?[Global?work?size]?*/??
  163. ????/*?
  164. ?????*?Each?instance?of?our?OpenCL?kernel?operates?on?a?single?element?of?each?array?so?the?number?of?
  165. ?????*?instances?needed?is?the?number?of?elements?in?the?array.?
  166. ?????*/??
  167. ????size_t?globalWorksize[1]?=?{arraySize};??
  168. ????/*?Enqueue?the?kernel?*/??
  169. ????if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
  170. ????{??
  171. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  172. ????????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  173. ????????return?1;??
  174. ????}??
  175. ????/*?[Global?work?size]?*/??
  176. ??
  177. ????/*?Wait?for?kernel?execution?completion.?*/??
  178. ????if?(!checkSuccess(clFinish(commandQueue)))??
  179. ????{??
  180. ????????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  181. ????????cerr?<<?"Failed?waiting?for?kernel?execution?to?finish.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  182. ????????return?1;??
  183. ????}??
  184. ??
  185. ????/*?Print?the?profiling?information?for?the?event.?*/??
  186. ????printProfilingInfo(event);??
  187. ????/*?Release?the?event?object.?*/??
  188. ????if?(!checkSuccess(clReleaseEvent(event)))??
  189. ????{??
  190. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  191. ???????cerr?<<?"Failed?releasing?the?event?object.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  192. ???????return?1;??
  193. ????}??
  194. ??
  195. ????/*?Get?a?pointer?to?the?output?data.?*/??
  196. ????cl_int*?output?=?(cl_int*)clEnqueueMapBuffer(commandQueue,?memoryObjects[2],?CL_TRUE,?CL_MAP_READ,?0,?bufferSize,?0,?NULL,?NULL,?&errorNumber);??
  197. ????if?(!checkSuccess(errorNumber))??
  198. ????{??
  199. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  200. ???????cerr?<<?"Failed?to?map?buffer.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  201. ???????return?1;??
  202. ????}??
  203. ??
  204. ????/*?[Output?the?results]?*/??
  205. ????/*?Uncomment?the?following?block?to?print?results.?*/??
  206. ????/*?
  207. ????for?(int?i?=?0;?i?<?arraySize;?i++)?
  208. ????{?
  209. ????????cout?<<?"i?=?"?<<?i?<<?",?output?=?"?<<??output[i]?<<?"\n";?
  210. ????}?
  211. ????*/??
  212. ????/*?[Output?the?results]?*/??
  213. ??
  214. ????/*?Unmap?the?memory?object?as?we?are?finished?using?them?from?the?CPU?side.?*/??
  215. ????if?(!checkSuccess(clEnqueueUnmapMemObject(commandQueue,?memoryObjects[2],?output,?0,?NULL,?NULL)))??
  216. ????{??
  217. ???????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  218. ???????cerr?<<?"Unmapping?memory?objects?failed?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  219. ???????return?1;??
  220. ????}??
  221. ??
  222. ????/*?Release?OpenCL?objects.?*/??
  223. ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  224. }??


4 向量化你的OpenCL代碼


4.1 向量基礎


  OpenCL設備可以通告它們為不同數據類型的首選向量寬度,你可以使用這個信息來選擇一個內核。結果是,相當于該內核為你正在運行的平臺做了優化。例如,一個設備可能僅有標量整數的硬件支持,而另一個設備則有寬度為4的整數向量的硬件支持。可以寫兩個版本的內核,一個用于標量,一個用于向量,在運行時選擇正確的版本。

  這里是一個在特定設備上詢問首選整數向量寬度的例子。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ?*?Query?the?device?to?find?out?it's?prefered?integer?vector?width.?
  3. ?*?Although?we?are?only?printing?the?value?here,?it?can?be?used?to?select?between?
  4. ?*?different?versions?of?a?kernel.?
  5. ?*/??
  6. cl_uint?integerVectorWidth;??
  7. clGetDeviceInfo(device,?CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT,?sizeof(cl_uint),?&integerVectorWidth,?NULL);??
  8. cout?<<?"Prefered?vector?width?for?integers:?"?<<?integerVectorWidth?<<?endl;??
對于其它OpenCL數據類型也是一樣的。

  每一個Mali T600系列GPU核最少有兩個128位寬度的ALU(算數邏輯單元),它們具有矢量計算能力。ALU中的絕大多數操作(例如,浮點加,浮點乘,整數加,整數乘),可以以128位向量數據操作(例如,char16, short8, int4, float4)。使用前面講述的詢問方法來為你的數據類型決定使用正確的向量大小。

  當使用Mali T600系列GPU時,我們推薦在任何可能的地方使用向量


4.2 向量化代碼


  首先,修改內核代碼以支持向量運算。對于Mali T600系列GPU來說,一個向量運算的時間與一個整數加法的時間是一樣的。具體代碼解讀,見下面代碼中的注釋部分。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. __kernel?void?hello_world_vector(__global?int*?restrict?inputA,??
  2. ?????????????????????????????????__global?int*?restrict?inputB,??
  3. ?????????????????????????????????__global?int*?restrict?output)??
  4. {??
  5. ????/*?
  6. ?????*?We?have?reduced?the?global?work?size?(n)?by?a?factor?of?4?compared?to?the?hello_world_opencl?sample.?
  7. ?????*?Therefore,?i?will?now?be?in?the?range?[0,?(n?/?4)?-?1].?
  8. ?????*/??
  9. ????int?i?=?get_global_id(0);??
  10. ????/*?
  11. ?????*?Load?4?integers?into?'a'.?
  12. ?????*?The?offset?calculation?is?implicit?from?the?size?of?the?vector?load.?
  13. ?????*?For?vloadN(i,?p),?the?address?of?the?first?data?loaded?would?be?p?+?i?*?N.?
  14. ?????*?Load?from?the?data?from?the?address:?inputA?+?i?*?4.?
  15. ?????*/??
  16. ????int4?a?=?vload4(i,?inputA);??
  17. ????/*?Do?the?same?for?inputB?*/??
  18. ????int4?b?=?vload4(i,?inputB);??
  19. ????/*?
  20. ?????*?Do?the?vector?addition.?
  21. ?????*?Store?the?result?at?the?address:?output?+?i?*?4.?
  22. ?????*/??
  23. ????vstore4(a?+?b,?i,?output);??
  24. }??
   由于現在每個內核實例能夠實現多個加法運算,所以必須減少內核實例的數量,在宿主機代碼中的修改部分如下所示。

[cpp] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. /*?
  2. ?*?Each?instance?of?our?OpenCL?kernel?now?operates?on?4?elements?of?each?array?so?the?number?of?
  3. ?*?instances?needed?is?the?number?of?elements?in?the?array?divided?by?4.?
  4. ?*/??
  5. size_t?globalWorksize[1]?=?{arraySize?/?4};??
  6. /*?Enqueue?the?kernel?*/??
  7. if?(!checkSuccess(clEnqueueNDRangeKernel(commandQueue,?kernel,?1,?NULL,?globalWorksize,?NULL,?0,?NULL,?&event)))??
  8. {??
  9. ????cleanUpOpenCL(context,?commandQueue,?program,?kernel,?memoryObjects,?numberOfMemoryObjects);??
  10. ????cerr?<<?"Failed?enqueuing?the?kernel.?"?<<?__FILE__?<<?":"<<?__LINE__?<<?endl;??
  11. ????return?1;??
  12. }??
   折減系數基于向量的寬度,例如,如果我們在內核中使用int8代替int4,折減系數此時則為8。

   

5 運行OpenCL樣例


(1). 在SDK根目錄的命令行提示符中

[python] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. cd?samples\hello_world_vector??
  2. cs-make?install??
   這樣就編譯了向量化的OpenCL hello world樣例,拷貝了所有運行時需要的文件到SDK根目錄下的bin文件夾中。


(2) . 拷貝bin文件夾到目標板中


(3). 在板子上導航到該目錄,運行hello world二進制文件

[python] view plaincopyprint?在CODE上查看代碼片派生到我的代碼片
  1. chmod?777?hello_world_vector??
  2. ./hello_world_vector?

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/448676.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/448676.shtml
英文地址,請注明出處:http://en.pswp.cn/news/448676.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

PHP商城數據庫安全事務處理方法

現在是一個電商時代&#xff0c;做電商的首先得有一個商城&#xff0c;所以商城的安全也是不容忽視的&#xff0c;一個數據安全的商城離不開數據庫的事務處理&#xff0c;商城在資金、商品、下單、結賬等重要步驟加上事務控制這是不一定不可少的&#xff0c;像市場上的一些開源…

職業規劃之ABZ,未雨綢繆35歲危機

35歲的中年人正面臨著上有老下有小&#xff0c;房貸車貸等生活壓力&#xff0c;然而在職場上又面臨被裁員或在招聘中年齡超限的尷尬局面。那么&#xff0c;我們如何應對此危機呢&#xff1f;ABZ理論為我們打開了一扇天窗。 ABZ理論由著名的創業者投資人&#xff0c;LinkedIn和P…

集合去重 (集合元素為引用類型)--- java 8 新特性 --- 根據元素單屬性、多屬性實現去重

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 1. 代碼寫法&#xff1a; &#xff08;要求 JDK 1.8 或 1.8 以上&#xff09; package gentle.entity;import lombok.Data; /**** auth…

Django--Forms組件使用

Forms組件的使用 在html表單驗證中&#xff0c;需要通過各種信息的驗證&#xff0c;比如注冊界面的姓名、密碼、郵箱、電話等的驗證&#xff0c;是否符合定義好的規則&#xff0c;不可能每次都要取出對應的字段一一判斷&#xff0c;django內置了Forms組件&#xff0c;可以方便的…

yii2關聯表

asArray()這個方法很好用&#xff0c;返回數組是1版本想要的形式&#xff0c;這種方式有種tp框架的感覺轉載于:https://www.cnblogs.com/peipeiyu/p/10974487.html

詳細程序注解學OpenCL一 環境配置和入門程序

本專欄是通過注解程序的方法學習OpenCL&#xff0c;我覺得一個一個地去摳原理也不是辦法&#xff0c;干脆直接學習程序&#xff0c;然后把相關原理都直接注解到程序語句當中。 原創地址&#xff1a;http://blog.csdn.net/kenden23/article/details/14101657 一開始要配置好環境…

解決 IDEA 在 commit 代碼時 git 日志亂碼 (提交時填寫的中文說明亂碼)

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 1.問題描述&#xff1a; idea 開發代碼中的 中文正常&#xff0c;但提交到碼云時填寫的提交日志是亂碼。 提交到碼云后是這樣的&#…

Box 類

public class Box extends JComponent implements Accessible使用 BoxLayout 對象作為其布局管理器的一個輕量級容器。Box 提供幾個對使用 BoxLayout 的容器&#xff08;甚至非 Box 容器&#xff09;有用的類方法。 Box 類可以創建幾種影響布局的不可見組件&#xff1a;glue、s…

神奇的pdfkit工具——將字符串保存為pdf文件

神奇的pdfkit工具——將字符串保存為pdf文件 1、安裝工具包 pip install pdfkit 2、上干貨 import pdfkitdef create_pdf(str_data, to_file):將字符串生成pdf文件 # &#xff08;需下載wkhtmltox&#xff09;將程序路徑傳入config對象config pdfkit.configuration(wkhtmltopd…

OpenCL結構

原標題&#xff1a;從零開始學習OpenCL開發&#xff08;一&#xff09;架構 1 異構計算、GPGPU與OpenCL OpenCL是當前一個通用的由很多公司和組織共同發起的多CPU\GPU\其他芯片 異構計算&#xff08;heterogeneous&#xff09;的標準&#xff0c;它是跨平臺的。旨在充分利用GP…

docker-compose 使用小例

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 只是一個很簡單的小例。 1. 原本有的容器 2. docker-compose.yml 寫法&#xff1a; gentle 處可以任意寫&#xff0c;gentle 是我的項…

2019.6.20

今日內容 MongoDB可視化工具 一、Scrapy爬蟲框架 二、微信機器人 轉載于:https://www.cnblogs.com/jrc123/p/11062606.html

PCL點云 Lebel:Research

https://blog.csdn.net/wokaowokaowokao12345/article/details/73741957 https://blog.csdn.net/u010696366/article/category/3108337轉載于:https://www.cnblogs.com/radiumlrb/p/10986918.html

AMD GPU+VS2010的OpenCL配置

安裝開發環境可以參照DE4-530的OpenCL開發環境搭建&#xff08;最終版&#xff09;&#xff0c;這篇文章的大部分內容轉載自&#xff1a;http://www.verydemo.com/demo_c92_i226325.html AMD的Heterogeneous Computing有很多AMD的OpenCL資料&#xff0c;包括各種分析工具&#…

ABP開發框架前后端開發系列---(9)ABP框架的權限控制管理

在前面兩篇隨筆《ABP開發框架前后端開發系列---&#xff08;7&#xff09;系統審計日志和登錄日志的管理》和《ABP開發框架前后端開發系列---&#xff08;8&#xff09;ABP框架之Winform界面的開發過程》開始介紹了權限管理的內容&#xff0c;其中只是列出了內部的權限系統的審…

GIL , 線程池 , 同步 , 異步 , 隊列 , 事件

一.什么是GIL 官方解釋:In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython’s memory management is not thread-safe. (Howev…

Docker - Compose 使用說明、詳解docker-compose

Compose 模板文件 前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 模板文件是使用 Compose 的核心&#xff0c;涉及到的指令關鍵字也比較多。但大家不用擔心&#xff0c;這里面大部分指令…

在Windows下使用OpenCL配置

前言 目前&#xff0c;NVIDIA 和 AMD 的 Windows driver 均有支持OpenCL&#xff08;NVIDIA 的正式版 driver 是從自195.62 版開始&#xff0c;而 AMD則是從9.11 版開始&#xff09;。NVIDIA 的正式版 driver 中包含 OpenCL.dll&#xff0c;因此可以直接使用。AMD 到目前為止…

Linux下啟動mongodb

完成安裝mongodb&#xff08;略&#xff09; 創建數據目錄&#xff1a; # mkdir /data/mongo 創建配置文件 # vi /data/mongo/mongodb.cnf dbpath/data/mongo/ logpath/data/mongo/mongo.log logappendtrue forktrue port27017 或者&#xff1a;不創建配置文件通過mongod參數啟…

Linux調優(文件系統)

查看單個文件是否發生碎片化&#xff08;被存在磁盤非連續磁盤塊上&#xff09;# filefrag -v /var/log/messages 查看文件系統是否存在大量碎片&#xff08;會顯示空閑離散的塊&#xff09;# dumpe2fs /dev/sda1 檢查文件系統中是否有錯誤產生# fsck /dev/sdb1 1、格式化相關 …