從零開始搭建深度學習大廈系列-2.卷積神經網絡基礎（5-9）

(1)本人挑戰手寫代碼驗證理論，獲得一些AI工具無法提供的收獲和思考，對于一些我無法回答的疑問請大家在評論區指教；

(2)本系列文章有很多細節需要弄清楚，但是考慮到讀者的吸收情況和文章篇幅限制，選擇重點進行分享，如果有沒說清楚或者解釋錯誤的地方歡迎在評論區提出；

(3)寫的時候是用英文撰寫的，這里就不翻譯成中文了，希望大家理解；

(4)本系列內容基于李沐老師《動手學深度學習》教材，網址：

《動手學深度學習》 — 動手學深度學習 2.0.0 documentation

(5)由于代碼量較大，以免費資源的形式上傳到個人空間，方便讀者運行和使用。

注：AlexNet提供了Pytorch和MxNet兩種實現方式，LeNet只提供基于MxNet框架的實現。

原論文作者訓練后的模型參數也可以通過深度學習框架直接下載獲取，不過本實驗意在探究CNN的理論基礎和實現思路，因此從零開始訓練不同版本的"LeNet"和"AlexNet"。

同時提出了不同的代碼實現方案和分析思路，對于訓練成本較大的模型，建議使用Google Colaboratory提供的免費算力平臺，本質是配置了Pytorch和Tensorflow等深度學習框架的基于Ubuntu系統的服務器。

本篇主要分析：

【1】CNN卷積神經網絡中卷積層、池化層、批規范化層、激活層、“暫退層”的作用原理；

下篇文章主要分析：

【2】單CPU核訓練背景下的時間花費組成和實驗驗證，以及函數接口的加速效果；

【3】學習率、優化方法、批量大小、激活函數等超參數（Hyperparameters）的調參方法；

【4】卷積神經網絡（LeNet，1998）和深度卷積神經網絡（AlexNet，2012）在MNIST，Fashion_MNIST，CIFAR100數據集上的表現與一種可能可行的參數量自適應調整方法；

【5】CNN激活層特征可視化，直觀比對人工設計卷積核的濾波效果，理解CNN的信息提取過程；

【6】混淆矩陣的作用分析，繪制自定義混淆矩陣。

從零開始搭建深度學習大廈系列-3.卷積神經網絡基礎（5-9）-CSDN博客https://blog.csdn.net/2302_80464577/article/details/149260898?sharetype=blogdetail&sharerId=149260898&sharerefer=PC&sharesource=2302_80464577&spm=1011.2480.3001.8118

A Quick Look

LeNet(Based on mxnet; Textbook:2019+GPU，Max-pooling; Mine: Max-pooling+2 PREFETCHING PROCESS / 5? Prefetching processes)

2 PREFETCHING PROCESSES

5 Prefetching processes

Alexnet (Based on pytorch; Textbook:Original;Mine: Parameter size nearly 1/256 of original design)

2 Prefetching processes (Batch size=64)

5 Prefetching processes（Batch size=64/32, initial learning rate=0.01/0.03）

Figure 1 result of textbook’s VS mine

Content

Environment Setting. 5

Experiment Goals. 6

1.??? Edge Detection. 6

1.1????? Basic Principle. 6

1.2????? Function Design. 6

1.3????? Carrying-out Result 8

2.??? Shape of layers and kernels in a CNN.. 14

2.1????? Basic Theories. 15

2.2????? Code implementation(numpy,mxnet.gluon.nn,mxnet.nd) 17

2.3????? Result 18

3.??? 1x1 Convolution. 20

3.1????? Basic Theory. 20

3.2????? Code implementation (3 lines) 20

3.3????? Result 21

4-5 CNN Architecture Implementation and Evaluation. 21

About Data loaders. 22

About num _workers and prefetching processes. 23

4.??? LeNet Implementation (MxNet based) 24

4.1????? Basic Theories. 24

4.2????? Code Implementation. 25

4.3????? Model Evaluation on Fashion-MNIST dataset 26

4.3.1 Pooling: Maximum-pooling VS Average-pooling. 26

4.3.2 Optimization: sgd vs sgd+momentum(nag) 28

4.3.3 Activation Function: ReLU vs sigmoid. 28

4.3.4 Normalization Layer: Batch Normalization VS None. 29

4.3.5 Batch size: 64 vs 128. 31

4.3.6 Textbook Result(Batch Normalization) && Running Snapshot 32

4.4????? LeNet Evaluation on MNIST dataset 33

4.5????? Evaluating LeNet on CIFAR100. 35

4.5.1 Coarse Classification (20 classes) 36

4.5.2 Fine Classification (100 classes) 38

4.5.3 Running Snapshot 39

5.??? AlexNet Architecture. 39

5.1????? Code Implementation. 39

5.2????? Fashion_MNIST Dataset (Mxnet vs Pytorch) 44

5.3????? MNIST Dataset (Pytorch only) 45

5.4????? CIFAR100(100 classes, fine labels)-Pytorch Only. 46

5.4.1 Learning rate setting. 46

6.??? CNN activation layer characteristics visualization. 48

6.1????? MNIST Dataset 49

6.2????? Fashion_MNIST Dataset 52

7.??? Confusion Matrix. 54

7.1 MNIST.. 54

7.2 Fashion_MNIST.. 55

References. 55

Environment Setting

All of the four experiments are carried out on virtual environment based on Python interpreter 3.7.0 and mainly used packages include deep-learning package mxnet1.7.0.post2(PREFETCHING PROCESS version), visualization package matplotlib.pyplot, image processing package opencv-python, array manipulation package numpy.

Experiment Goals

Design appropriate kernels of fixed parameters and detect edges with horizontal, vertical, diagonal orientation separately;
Derive shape transformation formula in the forward propagation process of CNN(Convolutional Neural Network) and verify the result by fundamental coding and calling scripts;
Understand the effect and principle of 1x1 kernels and then explore different implementation versions of 1x1 convolution in 2-dimensional plane such as cross-correlation calculation and matrix multiplication;
Construct LeNet[2] by hand using mxnet.gluon.nn and explore how different settings of hyperparameters impact training result and model performance;
Construct AlexNet[3] by hand using torch.nn and explore how different settings of hyperparameters impact training result and model performance.

1.Edge Detection

1.1?Basic Principle

According to corresponding theories in DIP(Digital Image Processing), one-order difference calculators or kernels such as Prewitt and Sobel kernels with horizontal, vertical and two diagonal design versions can be used to detect edges in gray-scale images.

These kernels can filter out transition between different objects or parts of an object due to rapid change in intensity level of pixels distributed along the edges on both sides.

By the way, adding a comprehensive orientation algorithm to combine all direction information, see ‘combimg’ implementation for details.

1.2?Function Design

This section employs two tool functions to accomplish the goal: get_data(input_dir) for image loading(similar to building dataset) ;? edge_detect(input_dir) for cross-correlation calculation under different settings of kernel shape and layer shape.

Figure 2 Code implementation

1.3?Carrying-out Result

Choosing 6 different scenario photos with obvious edge information posted by professional photographers on websites as a mini-dataset.

Figure 3 Mini-dataset in Mission 1

Only saving the ‘combimgs’. ‘canyon.jpg’ interprets direction attribute of Prewitt kernel vividly.

Figure 4 canyon.jpg

Other examples are as follows, orientation of texture can verify the DIP theory more or less.

Figure 5 galaxy, notice the small dot in the picture with interesting behavior(The combination dot has a black circle within it, others only include shapes like rectangular line)

Figure 6 Bungalow laying in the embrace of lake and mountains

Figure 7 grassland and night sky in an estate

Figure 8 Clouds

2. Shape of layers and kernels in a CNN

2.1?Basic Theories

Figure 9 Kernel and Layer in a CNN

Unlike hidden neurons(intermediate outputs) and lines(weights) fully connecting them in a multiple-layer perceptron(MLP), CNN is mainly characterized by kernels(similar to weights in MLP) and feature maps(similar to nodes in MLP), with activation functions, normalization layers and some other designs together construct the architecture. Kernels can also be understood as component of certain CNN layers.

Kernel is here mainly to reduce overwhelming parameter size and to reuse parameter scientifically according to spatial distribution locality and adjacency principles. Input images are transformed to different feature maps after going through convolution or cross-correlation operations of kernels. Notice that these kernels can have either adjustable(Convolutional kernel) or non-adjustable parameter settings(Pooling kernel).

These feature maps can include any implicit information such as edges of objects and so on. Part I of this experiment demonstrates the effect of human-designed edge-detection kernels. For layers near the top of deeper neural networks, the feature maps within may indicate rather global information(sometimes nothing can be learned may be due to smaller images input and deeper depth, so ResNet was born), taking AlexNet and LeNet as examples.

Figure 10???????????? Characteristics Visualization & Understanding[1]

Figure 11????? Primitive CNN architecture proposed(1998,2012)

A common design problem is to estimate the parameter size(storage amount) and training time(PREFETCHING PROCESS/GPU hour measured) of a 2D-CNN architecture. The size of a featured map is fixed as ‘NCHW’ format(or ‘NHWC’), while the size of a kernel is denoted as ‘CoCiKhKw’(or ‘KhKwCiCo’). See Figure 9 for graphic explanation.

Figure 12???????????? Cross-correlation calculation at a 2D-convolutional layer

According to academic design and that within the textbook, NCHW and CoCiKhKw? should satisfy C== Ci. When Co==1, Ci different kernels are used to do convolution(equivalent to cross-correlation operations in manipulation section) separately in correspondence to feature maps. One kernel aims at one feature map with size Nx1xHxW.

The result is obtained by pixel-wise summation on Ci different Nx1xHxW to get a composite Nx1xH2xW2 feature map with richer information. Then repeating similar process Co times to achieve final output: NxCoxH2xW2. Kernel size, paddings and strides and three basic settings for convolution operation which leads to different mapping: H->H2, W->W2.

Pooling layer has kernels with unlearnable parameters, generally divided into max-pooling and average-pooling.

2.2?Code implementation(numpy,mxnet.gluon.nn,mxnet.nd)

Use 2 ways to verify: direct hand-coding and package calling.

The input images are random values generated by numpy and simulate noises, serving to verify shapes of feature maps at the current layer. Kernels vary in Ci channels and are identical in Co channels. Use 5 nested loops to accomplish.

???????2.3?Result

These hand-coded kernels can be actually interpreted as smoothing filters with small variance because of k_base setting in the code block, however parameters in nn.Conv2D are initialized randomly without meaning at the start of training. By the way, the network layer isn’t initialized in this section because it is not necessary to do so.

nn.Conv2D can detect in_channels automatically and the layer goes through delayed initialization in this case. Reinitialization or assigning in_channels by hand can avoid delayed initialization.

In the simulation, N=2 and Ci=3, Co=4,H=360,W=480,Kh=Kw=3,ph=pw=1(only for unilateral),sh=sw=1. Result shows that the shape formula is correct.

Figure 13????? H2 and W2 should be floored to integer[1]

3. 1x1 Convolution

???????3.1?Basic Theory

1x1 convolution is specifically used to compress channels C of feature maps to contract parameters needed. In this case, ph=pw=0,sh=sw=1, Co<Ci.

???????3.2?Code implementation (3 lines)

Use NHWC format in 1x1 convolution matrix multiplication implementation.

A much slower implementation of 1x1 convolution is simply adjusting the parameter and calling hand-coded convsize_verify().

???????3.3?Result

The result indicates that mxnet.gluon.nn implements convolution in the form of matrix multiplication. Similar method can be generalized into implementation of general kernel size convolution:

Given a layer of N feature maps, first divide input feature maps into M(H2xW2) flattened pixel vectors(length=KhxKw);

then do dot product with flattened kernels on two dimensions;

finally, reshape the output feature maps to obtain result.

A more detailed explantion is as follows.