CUTLASS 1.3.3中的 Volta884

CUTLASS 是 CUDA C++ 模板抽象的集合，用于在 CUDA 內的所有級別和規模上實現高性能矩陣-矩陣乘法 (GEMM) 和相關計算。它采用了類似于 cuBLAS 和 cuDNN 中實現的分層分解和數據移動策略。

CUTLASS 最新版本為3.3，相比1.3.3變動較大。然而重溫一下1.3.3仍然是有意義的。因為它更易于理解：

與 PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS 中介紹的內容相匹配；
僅支持 Volta 一種 Tensor Core 架構；
Tensor Core 僅支持 half 一種數據類型；
僅采用HMMA.884.F16.F16一種指令。

Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply 中提到 T4 GPU 在引入 Tensor Core 之后，原來重計算瓶頸的 GEMM 也變成了 IO 瓶頸。雖然 V100的帶寬是 T4的三倍，然而帶寬不足問題同樣存在。因此，CUTLASS 對于數據路徑進行了如下優化：

全路徑128 bit 的訪問粒度：LDG.128、STS.128、LDS.128、STD.128；
無沖突共享內存排列：轉置時無需填充 Shared Memory；
Software Pipelining：LDG.128、LDS.128和HMMA.884.F16.F16三種指令并行，隱藏數據移動。

下面以一個矩陣乘測例為例，介紹 Volta884_h884gemm 的實現。

TEST(Volta884_h884gemm_128x64x32_nt, 520x264x136)

OutputTile即 threadblock tile，該測例下設置為32x64x128。WarpGemmShape為32x64x64，這個是固定值。
run_gemm 初始化 Volta884GemmTraits::Params 和 GemmTestbed，調用 Gemm::launch 運行后比對結果。

TEST(Volta884_h884gemm_64x64x32_nt, 520x264x136) {typedef cutlass::gemm::Volta884GemmTraits<cutlass::MatrixLayout::kColumnMajor,cutlass::MatrixLayout::kRowMajor,cutlass::Shape<32, 64, 128>,cutlass::Shape<32, 64, 64>,half,half,half,2> GemmTraits;run_gemm<GemmTraits>(520, 264, 136);
}

CUTLASS 中 Volta884實現的層次結構如下圖所示

gemm_kernel_nolb

Kernel 函數申請動態 Shared Memory，并傳遞給 GemmMainloop，然后調用 GemmMainloop::multiply_add 進行計算。

/// GEMM kernel without launch bounds specified
template <typename Gemm_>
__global__ /* __launch_bounds__(Gemm_::kThreads) */
void gemm_kernel_nolb(typename Gemm_::Params params) {// Dynamic shared memory base pointerextern __shared__ int GemmSharedStorageBase[];// Declare pointer to dynamic shared memory.typename Gemm_::SharedStorage *shared_storage = reinterpret_cast<typename Gemm_::SharedStorage *>(GemmSharedStorageBase);// Construct the GEMM object.Gemm_ gemm(params, *shared_storage);// Run GEMM.gemm.multiply_add();
}

Shared Memory 和寄存器需要兩個緩沖區，通過 SM 上的調度實現三條流水線并行。Global Memory 到 Shared Memory 的加載有同步，而從 Shared Memory 移動到寄存器時不需要同步。由于 Ampere 之前的架構不支持 Global Memory 到 Shared Memory 的直接拷貝，因此整個搬運過程比較復雜。如下圖所示，程序中多處調用 Copy::transform 函數生成transformed_fragment。原因應該是為了實現類型轉換，但 Volta 只支持 half，也就沒有實際作用。

template <typename Traits_>
struct GemmMainloop {//// Type definitions///// The traits.typedef Traits_ Traits;/// The GEMM mainlooptypedef typename Traits::KernelClass KernelClass;/// The shared storage.typedef typename Traits::SharedStorage SharedStorage;/// The scalar for A.typedef typename Traits::ScalarA ScalarA;/// The scalar for B.typedef typename Traits::ScalarB ScalarB;/// The scalar in the epilogue.typedef typename Traits::Epilogue::Scalar ScalarEpilogue;/// The scalar for C.typedef typename Traits::Epilogue::ScalarC ScalarC;/// The scalar for D.typedef typename Traits::Epilogue::ScalarD ScalarD;/// The index.typedef typename Traits::Index Index;/// Define the mainloop iteration sizetypedef typename Traits::MultiplyAdd MultiplyAdd;/// The number of threads.static int const kThreads = Traits::GemmConfig::kThreads;

AccumulatorsPerWarp為 GemmConfig::AccumulatorsPerWarp 即 Volta884MultiplyAdd::WarpGemmShape，為32x64x64。
Volta884MultiplyAdd::InstructionShape 為4x32x32。因此，kWarpGemmSteps為8。

  // Number of warp-level multiply-accumulate steps executed by each warp.static Index const kWarpGemmSteps =Traits::GemmConfig::AccumulatorsPerWarp::kD / MultiplyAdd::InstructionShape::kD;/*// Make sure we have at least 2 unrolling steps or our pipeling is not going to work.static_assert(kWarpGemmSteps >= 2, "The pipelining assumes at least two steps");*//// Use the params object defined in traitstypedef typename Traits::Params Params;//// Data members///// The params.Params const& params;/// SharedStorage objectSharedStorage& shared_storage;

  //// Methods///// Ctor.CUTLASS_DEVICE GemmMainloop(Params const& params_, SharedStorage& shared_storage_): params(params_), shared_storage(shared_storage_) {}

GemmMainloop::fetch_global

Volta884GemmTraits::GlobalLoadStream 即 GlobalLoadStreamPair 類型。
GlobalLoadStreamPair::residue 函數調用兩次 MMAGlobalLoadStream::residue，計算在線程塊 tile 最后一次加載所需的預測掩碼。
GlobalLoadStreamPair::copy 函數調用兩次 MMAGlobalLoadStream::copy 從 Global Memory 拷貝矩陣元素到寄存器。后者調用 TileLoadIterator::load_post_increment 函數。

  /// Fetches global stream pairtemplate <bool Residue>CUTLASS_DEVICE void fetch_global(typename Traits::GlobalLoadStream& global_to_shared_stream,Index outer_k) {// If residue portion and not calculating residue in prolog, update residue predicates now.if (Residue) {global_to_shared_stream.residue(outer_k);}global_to_shared_stream.copy();}

GemmMainloop::consume_tile

如果kWarpGemmSteps小于等于4，則為kGlobalStreamFirst，先從 Global Memory 加載下一次迭代的數據。

  /// Computes a warp-level GEMM on data held in shared memorytemplate <bool Residue, bool LastIteration>CUTLASS_DEVICE void consume_tile(typename Traits::GlobalLoadStream& global_to_shared_stream,typename Traits::SharedStream& shared_load_stream,typename MultiplyAdd::Accumulators& accumulators,Index outer_k) {// Whether to load global stream before loading shared streamconst bool kGlobalStreamFirst = (kWarpGemmSteps <= 4);// Load data for the next iteration of the main loop (unless it's the last iteration).if (kGlobalStreamFirst && !LastIteration) {fetch_global<Residue>(global_to_shared_stream, outer_k);}

首先從 Shared Memory 加載下一次迭代的輸入。擁有雙緩沖區。
MMASharedLoadStream::copy 調用 Volta884WarpMultiplicandLoadIterator::load 函數加載數據到寄存器中。
問題是前一步如果沒有調用 GemmMainloop::fetch_global，從 Shared Memory 拷貝不會有問題嗎？

    CUTLASS_PRAGMA_UNROLLfor (int step = 0; step < kWarpGemmSteps; ++step) {// Trigger the copy from shared memory for the next A/B values.shared_load_stream.copy((step + 1) % kWarpGemmSteps);

如果不是kGlobalStreamFirst ，在循環的第一步時調用GemmMainloop::fetch_global 函數加載輸入。

      // Load data for the next iteration of the main loop (unless it's the last iteration).if (!kGlobalStreamFirst && (step == 0) && !LastIteration) {fetch_global<Residue>(global_to_shared_stream, outer_k);}

如果是倒數第2步，需要確保數據已經加載到了 Shared Memory。
Volta884GemmTraits::shared_load_fence 根據外部傳入的StageCount來確定是否同步線程。
GlobalLoadStreamPair::commit 函數會分別調用兩個矩陣的 GlobalLoadStream::commit 拷貝到 Shared Memory。
Volta884GemmTraits::shared_store_fence 同步線程。
MMASharedLoadStream::inc_stage 遞增stage_index。

      if (step == kWarpGemmSteps - 2) {// Make sure the data from shared memory has been entirely consumed.Traits::shared_load_fence(true);global_to_shared_stream.commit();// Make sure the data is in shared memory.Traits::shared_store_fence(true);// Move to the next stage for the load (if it makes sense).shared_load_stream.inc_stage();}

MMASharedLoadStream::commit 調用 Copy 進行拷貝。Volta884WarpMultiplicandLoadIterator::Fragment 即 Fragment 。
Volta884MultiplyAdd::multiply_add 完成 Warp Tile 的計算。

      // Make sure the values are available for the current iteration to do the multiply-add.shared_load_stream.commit(step);// Do the math on the fragments of the current iteration.MultiplyAdd multiply_add;multiply_add.multiply_add(shared_load_stream.fragment_a(step),shared_load_stream.fragment_b(step),accumulators,accumulators);}}

GemmMainloop::multiply_add

make_Coord_from_shape 根據形狀創建一個 Coord 對象。

IdentityBlockSwizzle::get_threadblock_offset 獲得當前線程塊在輸出二維圖上的偏移。
Volta884GemmTraits::ClearAccumulators 即 ClearAccumulators。
IdentityBlockSwizzle::get_threadblock_bounds 返回 threadblock 的三維邊界。

  /// Do the GEMM.CUTLASS_DEVICE void multiply_add() {// Swizzle the IDs of the block (to enable better cache behavior).typename Traits::BlockSwizzle block_swizzle;Coord<3> threadblock_offset =block_swizzle.get_threadblock_offset(make_Coord_from_shape<typename Traits::OutputTile>());// We may want to use shared memory to clear the registers.typedef typename Traits::ClearAccumulators ClearAccumulators;// Get the bounds for each thread, it maybe different than problem_sizeCoord<3> bounds = block_swizzle.get_threadblock_bounds(params.problem_size,params.partitionK_range);

params.global_to_shared_stream即 GlobalLoadStreamPair::Params。
shared_storage.main_loop.global_to_shared_stream為 GlobalLoadStreamPair::SharedStorage。
shared_storage.main_loop.threadblock_tile為 GlobalLoadStreamPair::ThreadblockTileStorage，即 ZipTileAllocation。ZipTileAllocation::reference 返回指向數據的 ZipTensorRef 對象。
global_to_shared_stream為 Volta884GemmTraits::GlobalLoadStream 即 GlobalLoadStreamPair。
GlobalLoadStreamPair::add_batch_offset 調用 GlobalLoadStreamPair::add_batch_offset GlobalLoadStream::add_batch_offset 函數設置迭代器的 batch 偏移。

    // The streams to read A/B from global memory to shared memory.typename Traits::GlobalLoadStream global_to_shared_stream(params.global_to_shared_stream,shared_storage.main_loop.global_to_shared_stream,shared_storage.main_loop.threadblock_tile.reference(),bounds,threadblock_offset);// update A and B pointer offset based on batch_id and batch_stride_offsetglobal_to_shared_stream.add_batch_offset(block_swizzle.get_batch_id());// Create the accumulator clear.ClearAccumulators clear;

GlobalLoadStreamPair::move_to_residue 如果是在序幕中執行余數則調用 MMAGlobalLoadStream::move_to_residue 移動指針，否則直接調用 GlobalLoadStreamPair::residue 函數。
GlobalLoadStreamPair::copy 調用 MMAGlobalLoadStream::copy 函數，后者調用 TileLoadIterator::load_post_increment 加載 A 和 B 矩陣的片段到 Fragment 寄存器。
GlobalLoadStreamPair::commit 調用 MMAGlobalLoadStream::commit 函數，后者調用 Copy.transform 進行拷貝，然后調用
Volta884ThreadblockMultiplicandStoreIterator::store_post_increment 保存到 Shared Memory。
Volta884GemmTraits::shared_store_fence 同步 threadblock 內的線程。
GlobalLoadStreamPair::rollback 調用 MMAGlobalLoadStream::rollback 函數，后者調用 TileLoadIterator::initialize_predicates 初始化預測向量，然后移動偏移。

    // Deal with residue in prolog.// global_to_shared_stream.move_to_residue(params.problem_size[0], Traits::OutputTile::kD);global_to_shared_stream.move_to_residue(bounds[0], Traits::OutputTile::kD);// Fetch the fragments for A and B from global memory.global_to_shared_stream.copy();// Copy the elements to shared memory (after transformation if needed).global_to_shared_stream.commit();// Make sure the data is in shared memory.Traits::shared_store_fence(false);// Rollback to the beginning of the first tile (if residue exists).// global_to_shared_stream.rollback(params.problem_size[0] % Traits::OutputTile::kD);global_to_shared_stream.rollback(bounds[0] % Traits::OutputTile::kD);

shared_load_stream為 Volta884GemmTraits::SharedStream 類型，即 SharedStreamPair。
SharedStreamPair::copy 調用 MMASharedLoadStream::copy，后者調用 Volta884WarpMultiplicandLoadIterator::load 從 Shared Memory 加載。
accumulators為 Volta884MultiplyAdd::Accumulators 類型，即 Fragment。
ClearAccumulators::clear 調用 Fragment::clear 將存儲清零。
outer_k是什么？

    // The stream of data from shared memory to fragments.typename Traits::SharedStream shared_load_stream(params.shared_stream,shared_storage.main_loop.threadblock_tile.reference());// Trigger the copy from shared memory for the 1st stream.shared_load_stream.copy(0);// Allocate the accumulators.typename MultiplyAdd::Accumulators accumulators;// Clear the accumulators.clear.clear(accumulators);// Initial index// Index outer_k = params.problem_size[0] - Traits::OutputTile::kD;// problem_size[0] might be bigger than bounds[0]Index outer_k = bounds[0] - Traits::OutputTile::kD;

如果在序幕中計算了剩余，則僅最后一次處理余數。
GemmMainloop::consume_tile 計算k = Traits::OutputTile::kD的分塊。

    // Check if we are computing residue in prolog or not.if (Traits::GemmConfig::kResidueInProlog) {// Execute all mainloop iterations but the last one.CUTLASS_GEMM_LOOPfor (; outer_k > 0; outer_k -= Traits::OutputTile::kD) {CUTLASS_GEMM_LOOP_HEADERconsume_tile<false, false>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);}consume_tile<false, true>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);

否則，每次迭代都考慮余數。

    } else {// When kResidueSeparate = true, execute all mainloop iterations but the last two without any// consideration for K-residue or predicate updates. This improves the steady state of some// kernels.if (Traits::GemmConfig::kResidueSeparate) {CUTLASS_GEMM_LOOPfor (; outer_k > Traits::OutputTile::kD; outer_k -= Traits::OutputTile::kD) {CUTLASS_GEMM_LOOP_HEADERconsume_tile<false, false>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);}}// Execute remaining tiles with K-residue predicate updates enabled.CUTLASS_GEMM_LOOPfor (; outer_k > -Traits::OutputTile::kD; outer_k -= Traits::OutputTile::kD) {CUTLASS_GEMM_LOOP_HEADERconsume_tile<true, false>(global_to_shared_stream, shared_load_stream, accumulators, outer_k);}}

創建 MMAEpilogue 對象，然后調用 MMAEpilogue::epilogue 函數。

    typedef typename Traits::Epilogue Epilogue;Epilogue epilogue(params.epilogue, shared_storage.epilogue, params.problem_size.knm());epilogue.epilogue(accumulators, threadblock_offset, block_swizzle.get_batch_id());}
};

參考資料：

# [DOC] Where does cutlass’ detailed GEMM kernel? #526
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Modeling Deep Learning Accelerator Enabled GPUs
gpgpu-sim_distribution
理解Tensor Core
Flexible Performant GEMM Kernels on GPUs
CUDA Tensor Core編程
PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS
The NVIDIA Titan V Deep Learning Deep Dive: It’s All About The Tensor Cores
9.7.13.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
Numerical Behavior of NVIDIA Tensor Cores
CUDA Ampere Tensor Core HGEMM 矩陣乘法優化筆記 —— Up To 131 TFLOPS!
If we have two or four memory requests by a warp, do they need coalesced access/contiguity? #328
Do bank conflicts increase when using more shared memory?
How does parameter computeType affect the computation?
2.1.10. GEMM Algorithms Numerical Behavior
cuBLAS的使用
RAFT在Knowhere上的一些評估測試[1]
How does parameter computeType affect the computation?
cudnn-frontend/tree/main/samples/samples/conv_sample.cpp
Is a union in C++ actually a class?
A Generalized Micro-kernel Abstraction for GPU Linear Algebra
Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs
Double-buffering in shared memory, details? #227
Efficient GEMM in CUDA
Thread synchronization with syncwarp
Using CUDA Warp-Level Primitives
CUDA微架構與指令集（3）-SASS指令集分類
VOLTA Architecture and performance optimization
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Determining registers holding the data after executing LDG.E.128
劉冰、鄭鵬｜GPU編程和優化-最佳實踐分享