CUTLASS 是 CUDA C++ 模板抽象的集合,用于在 CUDA 內的所有級別和規模上實現高性能矩陣-矩陣乘法 (GEMM) 和相關計算。它采用了類似于 cuBLAS 和 cuDNN 中實現的分層分解和數據移動策略。
CUTLASS 最新版本為3.3,相比1.3.3變動較大。然而重溫一下1.3.3仍然是有意義的。因為它更易于理解:
- 與 PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS 中介紹的內容相匹配;
- 僅支持 Volta 一種 Tensor Core 架構;
- Tensor Core 僅支持 half 一種數據類型;
- 僅采用
HMMA.884.F16.F16
一種指令。
Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply 中提到 T4 GPU 在引入 Tensor Core 之后,原來重計算瓶頸的 GEMM 也變成了 IO 瓶頸。雖然 V100的帶寬是 T4的三倍,然而帶寬不足問題同樣存在。因此,CUTLASS 對于數據路徑進行了如下優化:
- 全路徑128 bit 的訪問粒度:
LDG.128
、STS.128
、LDS.128
、STD.128
; - 無沖突共享內存排列:轉置時無需填充 Shared Memory;
- Software Pipelining:
LDG.128
、LDS.128
和HMMA.884.F16.F16
三種指令并行,隱藏數據移動。
下面以一個矩陣乘測例為例,介紹 Volta884_h884gemm 的實現。
TEST(Volta884_h884gemm_128x64x32_nt, 520x264x136)
OutputTile
即 threadblock tile,該測例下設置為32x64x128。WarpGemmShape
為32x64x64,這個是固定值。
run_gemm 初始化 Volta884GemmTraits::Params 和 GemmTestbed,調用 Gemm::launch 運行后比對結果。
TEST(Volta884_h884gemm_64x64x32_nt, 520x264x136) {typedef cutlass::gemm::Volta884GemmTraits<cutlass::MatrixLayout::kColumnMajor,cutlass::MatrixLayout::kRowMajor,cutlass::Shape<32, 64, 128>,cutlass::Shape<32, 64, 64>,half,half,half,2> GemmTraits;run_gemm<GemmTraits>(520, 264, 136);
}
CUTLASS 中 Volta884實現的層次結構如下圖所示
gemm_kernel_nolb
Kernel 函數申請動態 Shared Memory,并傳遞給 GemmMainloop,然后調用 GemmMainloop::multiply_add 進行計算。
/// GEMM kernel without launch bounds specified
template <typename Gemm_>
__global__ /* __launch_bounds__(Gemm_::kThreads) */
void gemm_kernel_nolb(typename Gemm_::Params params) {// Dynamic shared memory base pointerextern __shared__ int GemmSharedStorageBase[];// Declare pointer to dynamic shared memory.typename Gemm_::SharedStorage *shared_storage = reinterpret_cast<typename Gemm_::SharedStorage *>(GemmSharedStorageBase);// Construct the GEMM object.Gemm_ gemm(params, *shared_storage);// Run GEMM.gemm.multiply_add();
}
GemmMainloop
GemmMainloop 實現了軟流水,如下圖所示:
Shared Memory 和寄存器需要兩個緩沖區,通過 SM 上的調度實現三條流水線并行。Global Memory 到 Shared Memory 的加載有同步,而從 Shared Memory 移動到寄存器時不需要同步。由于 Ampere 之前的架構不支持 Global Memory 到 Shared Memory 的直接拷貝,因此整個搬運過程比較復雜。如下圖所示,程序中多處調用 Copy::transform 函數生成transformed_fragment
。原因應該是為了實現類型轉換,但 Volta 只支持 half,也就沒有實際作用。
template <typename Traits_>
struct GemmMainloop {//// Type definitions///// The traits.typedef Traits_ Traits;/// The GEMM mainlooptypedef typename Traits::KernelClass KernelClass;/// The shared storage.typedef typename Traits::SharedStorage SharedStorage;/// The scalar for A.typedef typename Traits::ScalarA ScalarA;/// The scalar for B.typedef typename Traits::ScalarB ScalarB;/// The scalar in the epilogue.typedef typename Traits::Epilogue::Scalar ScalarEpilogue;/// The scalar for C.typedef typename Traits::Epilogue::ScalarC ScalarC;/// The scalar for D.typedef typename Traits::Epilogue::ScalarD ScalarD;/// The index.typedef typename Traits::Index Index;/// Define the mainloop iteration sizetypedef typename Traits::MultiplyAdd MultiplyAdd;/// The number of threads.static int const kThreads = Traits::GemmConfig::kThreads;
AccumulatorsPerWarp
為 GemmConfig::AccumulatorsPerWarp 即 Volta884MultiplyAdd::WarpGemmShape,為32x64x64。
Volta884MultiplyAdd::InstructionShape 為4x32x32。因此,kWarpGemmSteps
為8。
// Number of warp-level multiply-accumulate steps executed by each warp.static Index const kWarpGemmSteps =Traits::GemmConfig::AccumulatorsPerWarp::kD / MultiplyAdd::InstructionShape::kD;/*// Make sure we have at least 2 unrolling steps or our pipeling is not going to work.static_assert(kWarpGemmSteps >= 2, "The pipelining assumes at least two steps");*//// Use the params object defined in traitstypedef typename Traits::Params Params;//// Data members///// The params.Params const& params;/// SharedStorage objectSharedStorage& shared_storage;
//// Methods///// Ctor.CUTLASS_DEVICE GemmMainloop(Params const& params_, SharedStorage& shared_storage_): params(params_), shared_storage(shared_storage_) {}
GemmMainloop::fetch_global
Volta884GemmTraits::GlobalLoadStream 即 GlobalLoadStreamPair 類型。
GlobalLoadStreamPair::residue 函數調用兩次 MMAGlobalLoadStream::residue,計算在線程塊 tile 最后一次加載所需的預測掩碼。
GlobalLoadStreamPair::copy 函數調用兩次 MMAGlobalLoadStream::copy 從 Global Memory 拷貝矩陣元素到寄存器。后者調用 TileLoadIterator::load_post_increment 函數。
/// Fetches global stream pairtemplate <bool Residue>CUTLASS_DEVICE void fetch_global(typename Traits::GlobalLoadStream& global_to_shared_stream,Index outer_k) {// If residue portion and not calculating residue in prolog, update residue predicates now.if (Residue) {global_to_shared_stream.residue(outer_k);}global_to_shared_stream.copy();}
GemmMainloop::consume_tile
如果kWarpGemmSteps
小于等于4,則為kGlobalStreamFirst
,先從 Global Memory 加載下一次迭代的數據。
/// Computes a warp-level GEMM on data held in shared memorytemplate <bool Residue, bool LastIteration>CUTLASS_DEVICE void consume_tile(typename Traits::GlobalLoadStream& global_to_shared_stream,typename Traits::SharedStream& shared_load_stream,typename MultiplyAdd::Accumulators& accumulators,Index outer_k) {// Whether to load global stream before loading shared streamconst bool kGlobalStreamFirst = (kWarpGemmSteps <= 4);// Load data for the next iteration of the main loop (unless it's the last iteration).if (kGlobalStreamFirst && !LastIteration) {fetch_global<Residue>(global_to_shared_stream, outer_k);}
首先從 Shared Memory 加載下一次迭代的輸入。擁有雙緩沖區。
MMASharedLoadStream::copy 調用 Volta884WarpMultiplicandLoadIterator::load 函數加載數據到寄存器中。
問題是前一步如果沒有調用 GemmMainloop::fetch_global,從 Shared Memory 拷貝不會有問題嗎?
CUTLASS_PRAGMA_UNROLLfor (int step = 0; step < kWarpGemmSteps; ++step) {// Trigger the copy from shared memory for the next A/B values.shared_load_stream.copy((step + 1) % kWarpGemmSteps);
如果不是kGlobalStreamFirst
, 在循環的第一步時調用GemmMainloop::fetch_global 函數加載輸入。
// Load data for the next iteration of the main loop (unless it's the last iteration).if (!kGlobalStreamFirst && (step == 0) && !LastIteration) {fetch_global<Residue>(global_to_shared_stream, outer_k);}
如果是倒數第2步,需要確保數據已經加載到了 Shared Memory。
Volta884GemmTraits::shared_load_fence 根據外部傳入的StageCount
來確定是否同步線程。
GlobalLoadStreamPair::commit 函數會分別調用兩個矩陣的 GlobalLoadStream::commit 拷貝到 Shared Memory。
Volta884GemmTraits::shared_store_fence 同步線程。
MMASharedLoadStream::inc_stage 遞增stage_index
。
if (step == kWarpGemmSteps - 2) {// Make sure the data from shared memory has been entirely consumed.Traits::shared_load_fence(true);global_to_shared_stream.commit();// Make sure the data is in shared memory.Traits::shared_store_fence(true);// Move to the next stage for the load (if it makes sense).shared_load_stream.inc_stage();}
MMASharedLoadStream::commit 調用 Copy 進行拷貝。Volta884WarpMultiplicandLoadIterator::Fragment 即 Fragment 。
Volta884MultiplyAdd::multiply_add 完成 Warp Tile 的計算。
// Make sure the values are available for the current iteration to do the multiply-add.shared_load_stream.commit(step);// Do the math on the fragments of the current iteration.MultiplyAdd multiply_add;multiply_add.multiply_add(shared_load_stream.fragment_a(step),shared_load_stream.fragment_b(step),accumulators,accumulators);}}
GemmMainloop::multiply_add
make_Coord_from_shape 根據形狀創建一個 Coord 對象。
IdentityBlockSwizzle::get_threadblock_offset 獲得當前線程塊在輸出二維圖上的偏移。
Volta884GemmTraits::ClearAccumulators 即 ClearAccumulators。
IdentityBlockSwizzle::get_threadblock_bounds 返回 threadblock 的三維邊界。
/// Do the GEMM.CUTLASS_DEVICE void multiply_add() {// Swizzle the IDs of the block (to enable better cache behavior).typename Traits::BlockSwizzle block_swizzle;Coord<3> threadblock_offset =block_swizzle.get_threadblock_offset(make_Coord_from_shape<typename Traits::OutputTile>());// We may want to use shared memory to clear the registers.typedef typename Traits::ClearAccumulators ClearAccumulators;// Get the bounds for each thread, it maybe different than problem_sizeCoord<3> bounds = block_swizzle.get_threadblock_bounds(params.problem_size,params.partitionK_range);
params.global_to_shared_stream
即 GlobalLoadStreamPair::Params。
shared_storage.main_loop.global_to_shared_stream
為 GlobalLoadStreamPair::SharedStorage。
shared_storage.main_loop.threadblock_tile
為 GlobalLoadStreamPair::ThreadblockTileStorage,即 ZipTileAllocation。ZipTileAllocation::reference 返回指向數據的 ZipTensorRef 對象。
global_to_shared_stream
為 Volta884GemmTraits::GlobalLoadStream 即 GlobalLoadStreamPair。
GlobalLoadStreamPair::add_batch_offset 調用 GlobalLoadStreamPair::add_batch_offset GlobalLoadStream::add_batch_offset 函數設置迭代器的 batch 偏移。
// The streams to read A/B from global memory to shared memory.typename Traits::GlobalLoadStream global_to_shared_stream(params.global_to_shared_stream,shared_storage.main_loop.global_to_shared_stream,shared_storage.main_loop.threadblock_tile.reference(),bounds,threadblock_offset);// update A and B pointer offset based on batch_id and batch_stride_offsetglobal_to_shared_stream.add_batch_offset(block_swizzle.get_batch_id());// Create the accumulator clear.ClearAccumulators clear;
GlobalLoadStreamPair::move_to_residue 如果是在序幕中執行余數則調用 MMAGlobalLoadStream::move_to_residue 移動指針,否則直接調用 GlobalLoadStreamPair::residue 函數。
GlobalLoadStreamPair::copy 調用 MMAGlobalLoadStream::copy 函數,后者調用 TileLoadIterator::load_post_increment 加載 A 和 B 矩陣的片段到 Fragment 寄存器。
GlobalLoadStreamPair::commit 調用 MMAGlobalLoadStream::commit 函數,后者調用 Copy.transform 進行拷貝,然后調用
Volta884ThreadblockMultiplicandStoreIterator::store_post_increment 保存到 Shared Memory。
Volta884GemmTraits::shared_store_fence 同步 threadblock 內的線程。
GlobalLoadStreamPair::rollback 調用 MMAGlobalLoadStream::rollback 函數,后者調用 TileLoadIterator::initialize_predicates 初始化預測向量,然后移動偏移。
// Deal with residue in prolog.// global_to_shared_stream.move_to_residue(params.problem_size[0], Traits::OutputTile::kD);global_to_shared_stream.move_to_residue(bounds[0], Traits::OutputTile::kD);// Fetch the fragments for A and B from global memory.global_to_shared_stream.copy();// Copy the elements to shared memory (after transformation if needed).global_to_shared_stream.commit();// Make sure the data is in shared memory.Traits::shared_store_fence(false);// Rollback to the beginning of the first tile (if residue exists).// global_to_shared_stream.rollback(params.problem_size[0] % Traits::OutputTile::kD);global_to_shared_stream.rollback(bounds[0] % Traits::OutputTile::kD);
shared_load_stream
為 Volta884GemmTraits::SharedStream 類型,即 SharedStreamPair。
SharedStreamPair::copy 調用 MMASharedLoadStream::copy,后者調用 Volta884WarpMultiplicandLoadIterator::load 從 Shared Memory 加載。
accumulators
為 Volta884MultiplyAdd::Accumulators 類型,即 Fragment。
ClearAccumulators::clear 調用 Fragment::clear 將存儲清零。
outer_k
是什么?
// The stream of data from shared memory to fragments.typename Traits::SharedStream shared_load_stream(params.shared_stream,shared_storage.main_loop.threadblock_tile.reference());// Trigger the copy from shared memory for the 1st stream.shared_load_stream.copy(0);// Allocate the accumulators.typename MultiplyAdd::Accumulators accumulators;// Clear the accumulators.clear.clear(accumulators);// Initial index// Index outer_k = params.problem_size[0] - Traits::OutputTile::kD;// problem_size[0] might be bigger than bounds[0]Index outer_k = bounds[0] - Traits::OutputTile::kD;
如果在序幕中計算了剩余,則僅最后一次處理余數。
GemmMainloop::consume_tile 計算k = Traits::OutputTile::kD
的分塊。
// Check if we are computing residue in prolog or not.if (Traits::GemmConfig::kResidueInProlog) {// Execute all mainloop iterations but the last one.CUTLASS_GEMM_LOOPfor (; outer_k > 0; outer_k -= Traits::OutputTile::kD) {CUTLASS_GEMM_LOOP_HEADERconsume_tile<false, false>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);}consume_tile<false, true>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);
否則,每次迭代都考慮余數。
} else {// When kResidueSeparate = true, execute all mainloop iterations but the last two without any// consideration for K-residue or predicate updates. This improves the steady state of some// kernels.if (Traits::GemmConfig::kResidueSeparate) {CUTLASS_GEMM_LOOPfor (; outer_k > Traits::OutputTile::kD; outer_k -= Traits::OutputTile::kD) {CUTLASS_GEMM_LOOP_HEADERconsume_tile<false, false>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);}}// Execute remaining tiles with K-residue predicate updates enabled.CUTLASS_GEMM_LOOPfor (; outer_k > -Traits::OutputTile::kD; outer_k -= Traits::OutputTile::kD) {CUTLASS_GEMM_LOOP_HEADERconsume_tile<true, false>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);}}
創建 MMAEpilogue 對象,然后調用 MMAEpilogue::epilogue 函數。
typedef typename Traits::Epilogue Epilogue;Epilogue epilogue(params.epilogue, shared_storage.epilogue, params.problem_size.knm());epilogue.epilogue(accumulators, threadblock_offset, block_swizzle.get_batch_id());}
};
參考資料:
- # [DOC] Where does cutlass’ detailed GEMM kernel? #526
- Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
- Modeling Deep Learning Accelerator Enabled GPUs
- gpgpu-sim_distribution
- 理解Tensor Core
- Flexible Performant GEMM Kernels on GPUs
- CUDA Tensor Core編程
- PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS
- The NVIDIA Titan V Deep Learning Deep Dive: It’s All About The Tensor Cores
- 9.7.13.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
- Numerical Behavior of NVIDIA Tensor Cores
- CUDA Ampere Tensor Core HGEMM 矩陣乘法優化筆記 —— Up To 131 TFLOPS!
- If we have two or four memory requests by a warp, do they need coalesced access/contiguity? #328
- Do bank conflicts increase when using more shared memory?
- How does parameter computeType affect the computation?
- 2.1.10. GEMM Algorithms Numerical Behavior
- cuBLAS的使用
- RAFT在Knowhere上的一些評估測試[1]
- How does parameter computeType affect the computation?
- cudnn-frontend/tree/main/samples/samples/conv_sample.cpp
- Is a union in C++ actually a class?
- A Generalized Micro-kernel Abstraction for GPU Linear Algebra
- Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs
- Double-buffering in shared memory, details? #227
- Efficient GEMM in CUDA
- Thread synchronization with syncwarp
- Using CUDA Warp-Level Primitives
- CUDA微架構與指令集(3)-SASS指令集分類
- VOLTA Architecture and performance optimization
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Determining registers holding the data after executing LDG.E.128
- 劉冰、鄭鵬|GPU編程和優化-最佳實踐分享