Intro OpenCL Tutorial

Benedict R. Gaster, AMD Architect, OpenCL?

OpenCL? is a young technology, and, while a specification has been published (www.khronos.org/registry/cl/), there are currently few documents that provide a basic introduction with examples. This article helps make OpenCL? easier to understand and implement.

Note that:

  • I work at AMD, and, as such, I will test all example code on our implementation for both Windows? and Linux?; however, my intention is to illustrate the use of OpenCL? regardless of platform. All examples are written in pure OpenCL? and should run equally well on any implementation.
  • I have done my best to provide examples that work out-of-the-box on non-AMD implementations of OpenCL?, but I will not be testing them on non-AMD implementations; therefore, it is possible that an example might not work as expected on such systems. If this is the case, please let me know via our?OpenCL? forum, and I will do my best to rectify the code and publish an update.

The following “Hello World” tutorial provides a simple introduction to OpenCL?. I hope to follow up this first tutorial with additional ones covering topics such as:

  • Using platform and device layers to build robust OpenCL?
  • Program compilation and kernel objects
  • Managing buffers
  • Kernel execution
  • Kernel programming – basics
  • Kernel programming – synchronization
  • Matrix multiply – a case study
  • Kernel programming – built-ins

The “Hello World” program in OpenCL?

Here are some notes of caution on how the OpenCL? samples are written:

  • OpenCL? specifies a host API that is defined to be compatible with C89 and does not make any mention of C++ or other programming language bindings. Currently, there are several efforts to develop bindings for other languages (see the links at the end of this article), and, specifically, there has been a strong push to develop?C++ bindings. In this and subsequent tutorials, I use the C++ bindings exclusively and describe OpenCL? in these terms. See the OpenCL? 1.0 specification for the corresponding C API.Alternatively, you can view the source for the C++ bindings to see what underlying OpenCL? function is used, and with what arguments by the particular C++ binding.
  • OpenCL? defines a C-like language for programming compute device programs. These programs are passed to the OpenCL? runtime via API calls expecting values of type?char *. Often, it is convenient to keep these programs in separate source files. For this and subsequent tutorials, I assume the device programs are stored in files with names of the form?name_kernels.cl, where?name?varies, depending on the context, but the suffix?_kernels.cl?does not. The corresponding device programs are loaded at runtime and passed to the OpenCL? API. There are many alternative approaches to this; this one is chosen for readability.

For this first OpenCL? program, we start with the source for the host application.

Header files

Just like any other external API used in C++, you must include a header file when using the OpenCL? API. Usually, this is in the directory?CL?within the primary include directory. For the C++ bindings we have (replace the straight C API with?cl.h):

  1. #include <utility>
  2. #define __NO_STD_VECTOR // Use cl::vector instead of STL version
  3. #include <CL/cl.hpp>

For our program, we use a small number of additional C++ headers, which are agnostic to OpenCL?.

  1. #include <cstdio>
  2. #include <cstdlib>
  3. #include <fstream>
  4. #include <iostream>
  5. #include <string>
  6. #include <iterator>

As we will dynamically request an OpenCL? device to return the “Hello World\n” string, we define it as a constant to use in calculations.

  1. const std::string hw("Hello World\n");

Errors

A common property of most OpenCL? API calls is that they either return an error code (type?cl_int) as the result of the function itself, or they store the error code at a location passed by the user as a parameter to the call. As with any API call that can fail, it is important, for the application to check its behavior correctly in the case of error. For the most part we will not concern ourselves with recovering from an error; for simplicity, we define a function,?checkErr, to see that a certain call has completed successfully. OpenCL? returns the value?CL_SUCCESS?in this case. If it is not, it outputs a user message and exits; otherwise, it simply returns.

  1. inline void
  2. checkErr(cl_int err, const char * name)
  3. {
  4. if (err != CL_SUCCESS) {
  5. std::cerr << "ERROR: " << name
  6. << " (" << err << ")" << std::endl;
  7. exit(EXIT_FAILURE);
  8. }
  9. }

A common paradigm for error handling in C++ is through the use of exceptions, and the OpenCL? C++ bindings provide just such an interface. A later tutorial will cover the use of exceptions and other optional features provided by the C++ bindings. For now, let’s look at the one remaining function, “main,” necessary for our first OpenCL? application.

OpenCL? Contexts

The first step to initializing and using OpenCL? is to create a?context. The rest of the OpenCL? work (creating devices and memory, compiling and running programs) is performed within this?context. A?context?can have a number of associated devices (for example, CPU or GPU devices), and, within a?context, OpenCL? guarantees a relaxed memory consistency between devices. We will look at this in detail in a later tutorial; for now, we use a single device,?CL_DEVICE_TYPE_CPU, for the CPU device. We could have used?CL_DEVICE_TYPE_GPU?or some other support device type, assuming that the OpenCL? implementation supports that device. But before we can create a?context?we must first queue the OpenCL runtime to determine which platforms, i.e. different vendor’s OpenCL implementations, are present. The classcl::Platform?provides the static method cl::Platform::get for this and returns a list of platforms. For now we select the first platform and use this to create a?context. The constructor?cl::Context?should be successful and, in this case, the value of?err?is?CL_SUCCESS.

  1. int
  2. main(void)
  3. {
  4. cl_int err;
  5. cl::vector< cl::Platform > platformList;
  6. cl::Platform::get(&platformList);
  7. checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get");
  8. std::cerr << "Platform number is: " << platformList.size() << std::endl;std::string platformVendor;
  9. platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, &platformVendor);
  10. std::cerr << "Platform is by: " << platformVendor << "\n";
  11. cl_context_properties cprops[3] =
  12. {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};cl::Context context(
  13. CL_DEVICE_TYPE_CPU,
  14. cprops,
  15. NULL,
  16. NULL,
  17. &err);
  18. checkErr(err, "Conext::Context()");

Before delving into compute devices, where the ‘real’ work happens, we first allocate an OpenCL? buffer to hold the result of the kernel that will be run on the device, i.e. the string “Hello World\n.” For now we simply allocate some memory on the host and request that OpenCL? use this memory directly, passing the flagCL_MEM_USE_HOST_PTR, when creating the buffer.

  1. char * outH = new char[hw.length()+1];
  2. cl::Buffer outCL(
  3. context,
  4. CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
  5. hw.length()+1,
  6. outH,
  7. &err);
  8. checkErr(err, "Buffer::Buffer()");

?

OpenCL??Devices

In OpenCL? many operations are performed with respect to a given context. For example, buffers (1D regions of memory) and images (2D and 3D regions of memory) allocation are all context operations. But there are also device specific operations. For example, program compilation and kernel execution are on a per device basis, and for these a specific device handle is required. So how do we obtain a handle for a device? We simply query a context for it. OpenCL? provides the ability to queue information about particular objects, and using the C++ API it comes in the form of?object.getInfo<CL_OBJECT_QUERY>(). In the specific case of getting the device from a context:

  1. cl::vector<cl::Device> devices;
  2. devices = context.getInfo<CL_CONTEXT_DEVICES>();
  3. checkErr(
  4. devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0");

Now that we have the list of associated devices for a context, in this case a single CPU device, we need to load and build the compute program (the program we intend to run on the device, or more generally: devices). The first few lines of the following code simply load the OpenCL? device program from disk, convert it to a string, and create a?cl::Program::Sources?object using the helper constructor. Given an object of type?cl::Program::Sources?a?cl::Program, an object is created and associated with a context, then built for a particular set of?devices.

  1. std::ifstream file("lesson1_kernels.cl");
  2. checkErr(file.is_open() ? CL_SUCCESS:-1, "lesson1_kernel.cl");std::string prog(
  3. std::istreambuf_iterator<char>(file),
  4. (std::istreambuf_iterator<char>()));cl::Program::Sources source(1,
  5. std::make_pair(prog.c_str(), prog.length()+1));cl::Program program(context, source);
  6. err = program.build(devices,"");
  7. checkErr(err, "Program::build()");

A given?program?can have many entry points, called kernels, and to call one we must build a kernel object. There is assumed to exist a straightforward mapping from kernel names, represented as strings, to a function defined with the?__kernel?attribute in the compute program. In this case we can build a?cl::kernel?object,?kernel. Kernel arguments are set using the C++ API with?kernel.setArg(), which takes the index and value for the particular argument.

  1. cl::Kernel kernel(program, "hello", &err);
  2. checkErr(err, "Kernel::Kernel()");err = kernel.setArg(0, outCL);
  3. checkErr(err, "Kernel::setArg()");

Now that the boiler plate code is done, it is time to compute the result (the output buffer with the string “Hello World\n”). All device computations are done using a command queue, which is a virtual interface for the device in question. Each command queue has a one-to-one mapping with a given device; it is created with the associated?context?using a call to the constructor for the class?cl::CommandQueue. Given a?cl::CommandQueue?queue,kernels can be queued usingqueue.enqueuNDRangeKernel. This queues a?kernel?for execution on the associated device. The kernel can be executed on a 1D, 2D, or 3D domain of indexes that execute in parallel, given enough resources. The total number of elements (indexes) in the launch domain is called the?global?work size; individual elements are known as?work-items.?Work-items?can be grouped into?work-groups?when communication between?work-items?is required.?Work-groups?are defined with a sub-index function (called the?local?work size), describing the size in each dimension corresponding to the dimensions specified for the global launch domain. There is a lot to consider with respect to kernel launches, and we will cover this in more detail in future tutorials. For now, it is enough to note that for Hello World, each work-item computes a letter in the resulting string; and it is enough to launch?hw.length()+1, where?hw?is the?const std::string?we defined at the beginning of the program. We need the extra?work-item?to account for the?NULL?terminator.

  1. cl::CommandQueue queue(context, devices[0], 0, &err);
  2. checkErr(err, "CommandQueue::CommandQueue()");cl::Event event;
  3. err = queue.enqueueNDRangeKernel(
  4. kernel,
  5. cl::NullRange,
  6. cl::NDRange(hw.length()+1),
  7. cl::NDRange(1, 1),
  8. NULL,
  9. &event);
  10. checkErr(err, "ComamndQueue::enqueueNDRangeKernel()");

The final argument to the?enqueueNDRangeKernel?call above was a?cl::Event?object, which can be used to query the status of the command with which it is associated, (for example, it has completed). It supports the method?wait()?that blocks until the command has completed. This is required to ensure the kernel has finished execution before reading the result back into host memory with?queue.enqueueReadBuffer(). With the compute result back in host memory, it is simply a matter of outputting the result to?stdout?and exiting the program.

  1. event.wait();
  2. err = queue.enqueueReadBuffer(
  3. outCL,
  4. CL_TRUE,
  5. 0,
  6. hw.length()+1,
  7. outH);
  8. checkErr(err, "ComamndQueue::enqueueReadBuffer()");
  9. std::cout << outH;
  10. return EXIT_SUCCESS;
  11. }

Finally, to make the program complete an implementation for the device program (lesson1_kernels.cl), requires defining the external entry point, hello. The kernel implementation is straightforward: it calculates a unique index as a function of the launch domain using?get_global_id(), it uses it as an index into the string,?hw, then writes its value to the output array,?out.

  1. #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
  2. __constant char hw[] = "Hello World\n";
  3. __kernel void hello(__global char * out)
  4. {
  5. size_t tid = get_global_id(0);
  6. out[tid] = hw[tid];
  7. }

For robustness, it would make sense to check that the thread id (tid) is not out of range of the hw; for now, we assume that the corresponding call toqueue.enqueueNDRangeKernel()?is correct.

Building and running

On Linux, it should be enough to use a single command to build the OpenCL? program; for example:
gcc –o hello_world –Ipath-OpenCL-include –Lpath-OpenCL-libdir lesson1.cpp –lOpenCL

To run:
LD_LIBRARY_PATH=path-OpenCL-libdir ./hello_world

On Windows, with a Visual Studio command window, an example is:
cl /Fehello_world.exe /Ipath-OpenCL-include lesson.cpp path-OpenCL-libdir/OpenCL.lib

Let’s assume that OpenCL.dll is on the path, then, running
.\hello_world

outputs the following string pm stdout:
Hello World

This completes our introductory tutorial to OpenCL?. Your feedback, comments, and questions are requested. Please visit our??OpenCL? forum.

Useful Links

The following list provides links to some specific programming bindings, other than C, for OpenCL?. I have not tested these and cannot vouch for their correctness, but hope they will be useful:

  • OpenCL? specification and headers:
    http://www.khronos.org/registry/cl/
  • OpenCL? technical forum:
    http://www.khronos.org/message_boards/viewforum.php?f=28
  • The C++ bindings used in this tutorial can be found on the OpenCL? web page at Khronos, along with complete documentation:
    http://www.khronos.org/registry/cl/
  • Python bindings can be found here:
    http://wiki.tiker.net/PyOpenCL
  • C# bindings can be found here:
    http://www.khronos.org/message_boards/viewtopic.php?f=28&t=1932

?

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/448870.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/448870.shtml
英文地址,請注明出處:http://en.pswp.cn/news/448870.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

雷林鵬分享:codeigniter框架文件上傳處理

CodeIgniter 框架input表單的重新填充&#xff0c;主要是針對text、radio、checkbox、select等input表單&#xff0c;那么對于文件上傳表單file該如何處理呢? 自己的處理方式&#xff1a; //設置文件上傳屬性 $webroot $_SERVER[DOCUMENT_ROOT]; $time time(); $year date(…

jQuery基本使用

一.what 1&#xff09;.一個優秀的JS函數庫&#xff1b; 2&#xff09;.中大型WEB項目開發的首選&#xff1b; 3&#xff09;.使用了jQuery的網站超過90%&#xff1b; 4&#xff09;.http://jquery.com/; 二.why(即jq的好處) html元素選取&#xff08;選擇器&#xff09;&#…

解決:-bash: telnet: command not found

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 報錯如題 -bash: telnet: command not found只是因為沒有安裝這個命令&#xff0c;不識別。 安裝命令&#xff1a; yum install telne…

錢荒下銀行理財收益率角逐:郵儲銀行墊底

21世紀資管研究員松壑 由于銀行理財的收益定價機制為設定預期收益率的“先行定價”&#xff0c;而銀行對產品本金收益又保有或明或暗的兌付要求&#xff0c;其業績往往在理財產品發行前就已決定。 因此&#xff0c;本次榜單根據已披露最高預期收益率&#xff08;下稱收益率&a…

數據結構7.3_圖的遍歷

我們希望從圖中某一頂點出發訪遍圖中其余頂點&#xff0c;且使每一個頂點僅被訪問一次。 這一過程就叫做圖的遍歷。 圖的遍歷算法是求解圖的連通性問題、拓撲排序和求關鍵路徑等算法的基礎。 然而&#xff0c;圖的遍歷要比樹的遍歷復雜得多。 因為圖的任一頂點都可能和其余的頂…

CentOS7 使用 firewalld 打開關閉防火墻與端口

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 1、firewalld的基本使用 啟動&#xff1a; systemctl start firewalld關閉&#xff1a; systemctl stop firewalld查看狀態&#xff1a…

HCL實驗四

PC端配置&#xff1a;配置ip地址 配置網關 交換機配置&#xff1a;①創建VLAN system-view vlan 10 vlan 20 ②配置PC端接口 interface vlan-interface 10 ip add 192.168.10.254 24 interface vlan-interface 20 ip add 192.168.20.254 24 轉載于:https://www.cnblogs.com/zy5…

程序員/設計師能用上的 75 份速查表

本文由 伯樂在線 - 黃利民 翻譯自 designzum。歡迎加入 技術翻譯小組。轉載請參見文章末尾處的要求。75 份速查表&#xff0c;由 vikas 收集整理&#xff0c;包括&#xff1a;jQuery、HTML、HTML5、CSS、CSS3、JavaScript、Photoshop 、git、Linux、Java、Perl、PHP、Python、…

移動端真機測試怎么做

準備工作&#xff1a; 1、必須安裝了node 環境和npm&#xff1b; 2、手機和電腦在同一個熱點或者wifi下&#xff1b; 3、知道你的IP地址&#xff1b; 步驟一、 啟動cmd&#xff0c;進入項目根目錄&#xff0c;使用指令&#xff1a;npm i -g live-server 進行全局安裝 步驟二、 …

Linux 下清空或刪除大文件內容的 5 種方法

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 下面的這些方法都是從命令行中達到清空文件的目的。 使用名為 access.log 的文件作為示例樣本。 1. 通過重定向到 Null 來清空文件內容…

管理飛揚跋扈的技術部

摘要&#xff1a;有的管理人員認為最頭疼的就是技術部的管理。因為技術工作看起來棘手&#xff0c;管理人員不能輕易了解技術工作的內涵&#xff0c;技術人員也覺得很難和管理人員溝通。要管理好技術人員&#xff0c;就一定要懂技術&#xff0c;這是其他管理方法都無法替代的。…

rocketmq 解決:There is insufficient memory for the Java Runtime Environment to continue

前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 1.場景描述 linux 安裝 rocketmq 啟動 mqnameserver、mqbroker 以及運行測試類生產者時報錯。 運行命令為&#xff1a; nohup sh bin…

GWAS: 網頁版的基因型填充(genotype imputation)

在全基因組關聯分析中&#xff0c;處理芯片數據時&#xff0c;必須走的一個流程就是基因型數據填充&#xff08;imputation&#xff09;。 當然&#xff0c;如果你拿到的是全測序的數據&#xff0c;請忽略這一步。 下面直奔主題&#xff0c;怎么在網頁版進行基因型填充。 1 進入…

【案例】圖片無縫輪播效果

知識點&#xff1a; 1、scrollLeft屬性 2、克隆節點 3、定時器 4、鼠標移入移除事件 <!DOCTYPE html> <html lang"en"> <head> <meta charset"UTF-8"> <title>無縫輪播</title> <style> *{ margin: 0; padding:…

騰訊CKV海量分布式存儲系統

摘要&#xff1a;騰訊CKV&#xff0c;是騰訊自主研發的高性能、低延時、持久化、分布式KV存儲服務。在騰訊的微信平臺、開放平臺、騰訊云、騰訊游戲和電商平臺廣泛使用&#xff0c;日訪問量超過萬億次。本文將全面剖析CKV的實現原理和技術挑戰。 與Memcached和Redis等開源NoSQ…

Apache RocketMQ 安裝、測試、報錯解決

1. 準備 前些天發現了一個巨牛的人工智能學習網站&#xff0c;通俗易懂&#xff0c;風趣幽默&#xff0c;忍不住分享一下給大家。點擊跳轉到教程。 64bit OS, Linux/Unix/Mac 64bit JDK 1.8; Maven 3.2.x 2.下載和構建 下載 4.2.0 源代碼版本地址&#xff1a;http://mirro…

編程之法:面試和算法心得

《編程之法&#xff1a;面試和算法心得》高清中文版PDF 含書目錄 下載地址&#xff1a; 鏈接&#xff1a;https://pan.baidu.com/s/1Kcd2bRsIfhagKZR6NaOgXg 提取碼&#xff1a;054s 《編程之法&#xff1a;面試和算法心得》高清中文版PDF高清中文版PDF 含書目錄&#xff0c;36…

localStorage存、取數組

localStorage存儲數組時需要先使用JSON.stringify()轉成字符串&#xff0c;取的時候再字符串轉數組JSON.parse()。 var arr[1,2,3,4];localStorage.setItem(key,arr);console.log(localStorage(key); //打印出字符串&#xff1a;1,2,3,4 正常存儲&#xff1a;localStorage.setI…

10歲起編程,并不認為自己是“黑客”

摘要&#xff1a;一直以來&#xff0c;女性在“黑客”群體中缺乏代表性&#xff0c;但這不是因為她們缺乏興趣。麻省理工學院的Liz Denys從十歲開始接觸編程&#xff0c;但由于被忽視以及性別歧視問題&#xff0c;她和許多女性一樣&#xff0c;游走在“黑客”圈子之外。 我10歲…

Redis原理及拓展

Redis是單線程程序。單線程的Redis為何還能這么快&#xff1f; 1、所有的數據都在內存中&#xff0c;所有的運算都是內存級別的運算&#xff08;因此時間復雜度為O(n)的指令要謹慎使用&#xff09; 2、單線程操作&#xff0c;避免了頻繁的上下文切換 3、多路復用&#xff08;非…