[TOC]

Chapter 4 Classic Network

4.1 LeNet-5

4.1.1 Introduction to the model

LeNet-5 is a Convolutional Neural Network (CNN) [1]^{[1]} proposed by LeCunLeCun for recognizing handwritten digits and machine-printed characters. The name is derived from the author LeCunLeCun The name, 5 is the code name for its research, and LeNet-4 and LeNet-1 were little known before LeNet-5. LeNet-5 illustrates that the correlation between pixel features in an image can be extracted by a convolution operation shared by parameters, and a combination of convolution, downsampling (pooling), and nonlinear mapping is currently popular. The basis of most depth image recognition networks.

4.1.2 Model structure

Figure 4.1 LeNet-5 network structure

As shown in Figure 4.1, LeNet-5 consists of 7 layers (the input layer is not used as the network structure), which consists of 2 convolution layers, 2 downsampling layers and 3 connection layers. The parameter configuration of the network is shown in Table 4.1. The kernel size of the downsampling layer and the fully connected layer respectively represents the sampling range and the size of the connection matrix (eg, “5\times5\times1/1,6” in the convolution kernel size indicates that the kernel size is 5×5×15\times5 \times1, convolution kernel with a step size of 11 and a core number of 6.)

Table 4.1 LeNet-5 Network Parameter Configuration

Network Layer Input Size Core Size Output Size Trainable Parameter Quantity
Convolutional Layer C1C_1 32×32×132\times32\times1 5×5×1/1,65\times5\times1/1,6 28×28×628\times28\times6 (5×5×1+1)×6(5\times5\times1+1)\times6
Downsampling layer S2S_2 28×28×628\times28\times6 2×2/22\times2/2 14×14×614\times14\times6 (1+1)×6(1+1)\times6 ^*
Convolutional Layer C3C_3 14×14×614\times14\times6 5×5×6/1,165\times5\times6/1,16 10×10×1610\times10\times16 15161516^*
Downsampling layer S4S_4 10×10×1610\times10\times16 2×2/22\times2/2 5×5×165\times5\times16 (1+1)×16(1+1)\times16
Convolutional Layer C5C_5^* 5×5×165\times5\times16 5×5×16/1,1205\times5\times16/1,120 1×1×1201\times1\times120 (5×5×16+1)×120(5\times5\times16+1)\times120
Full Connect Layer F6F_6 1×1×1201\times1\times120 120×84120\times84 1×1×841\times1\times84 (120+1)×84(120+1)\times84
Output Layer 1×1×841\times1\times84 84×1084\times10 1×1×101\times1\times10 (84+1)×10(84+1)\times10

^* In LeNet, the downsampling operation is similar to the pooling operation, but after multiplying the sampled result by a coefficient and adding an offset term, the number of parameters sampled below is (1+1))×6(1+1) )\times6 instead of zero.

^* C3C_3 Convolutional layer training parameters are not directly connected to all feature maps in S2S_2, but are connected using the sampling feature as shown in Figure 4.2 (sparse connection). The generated 16 channel feature maps are mapped according to the adjacent three feature maps, the adjacent four feature maps, the non-adjacent four feature maps, and all six feature maps. The calculated number of parameters is calculated as 6 Times(25×3+1)+6×(25×4+1)+3×(25×4+1)+1×(25×6+1)=15166\ Times(25\times3+1)+6\times(25\times4+1)+3\times(25\times4+1)+1\times(25\times6+1)=1516, explained in the original paper There are two reasons for using this sampling method: the number of connections is not too large (the computing power of the current year is weak); forcing the combination of different feature maps can make the mapped feature maps learn different feature patterns.

FeatureMap

Figure 4.2 Sparse connection between feature maps between S2S_2 and C3C_3

^* C5C_5 The convolutional layer is shown in Figure 4.1 as a fully connected layer. The original paper explains that the convolution operation is actually used here, but the size is compressed to $1 just after the 5×55\times5 convolution. Times1$, the output looks very similar to a full connection.

4.1.3 Model Features

  • The convolutional network uses a three-layer sequence combination: convolution, downsampling (pooling), and non-linear mapping (the most important feature of LeNet-5, which forms the basis of the current deep convolutional network)
  • Extract spatial features using convolution
  • Downsampling using the mapped spatial mean
  • Non-linear mapping using tanhtanh or sigmoidsigmoid
  • Multilayer Neural Network (MLP) as the final classifier
  • Sparse connection matrix between layers to avoid huge computational overhead

4.2 AlexNet

4.2.1 Introduction to the model

AlexNet is the first deep convolutional neural network applied to image classification by AlexAlex $Krizhevsky $, which was 15.3% top-5 in the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Competition) image classification competition. The test error rate won the first place [2]^{[2]}. AlexNet uses GPU instead of CPU to make the model structure more complex in an acceptable time range. Its appearance proves the effectiveness of deep convolutional neural networks in complex models, making CNN popular in computer vision. Directly or indirectly triggered a wave of deep learning.

4.2.2 Model structure

Figure 4.3 AlexNet network structure

As shown in Figure 4.3, except for downsampling (pooling layer) and Local Responsible Normalization (LRN), AlexNet consists of 8 layers. The first 5 layers are composed of convolution layers, while the remaining 3 layers are all. Connection layer. The network structure is divided into upper and lower layers, corresponding to the operation process of the two GPUs, except for some layers in the middle (C3C_3 convolutional layer and F68F_{6-8} full connection layer will have GPU interaction), other Layer two GPUs calculate the results separately. The output of the last layer of the fully connected layer is input as softmaxsoftmax, and the probability values ​​corresponding to 1000 image classification labels are obtained. Excluding the design of the GPU parallel structure, the AlexNet network structure is very similar to that of LeNet. The parameter configuration of the network is shown in Table 4.2.

Table 4.2 AlexNet Network Parameter Configuration

Network Layer Input Size Core Size Output Size Trainable Parameter Quantity
Convolutional Layer C1C_1 ^* 224×224×3224\times224\times3 11×11×3/4,48(×2GPU)11\times11\times3/4,48(\times2_{GPU}) 55×55×48(×2GPU)55\times55\times48(\times2_{GPU}) (11×11×3+1)×48×2(11\times11\times3+1)\times48\times2
Downsampling layer SmaxS_{max}^* 55×55×48(×2GPU)55\times55\times48(\times2_{GPU}) 3×3/2(×2GPU)3\times3/2(\times2_{GPU}) 27×27×48( Times2GPU)27\times27\times48(\ Times2_{GPU}) 0
Convolutional Layer C2C_2 27×27×48(×2GPU)27\times27\times48(\times2_{GPU}) 5×5×48/1,128(×2GPU)5\times5\times48/1,128(\times2_{GPU}) $27\times27\times128(\times2_{GPU}) $ (5×5×48+1)×128×2(5\times5\times48+1)\times128\times2
Downsampling layer SmaxS_{max} 27×27×128(×2GPU)27\times27\times128(\times2_{GPU}) 3×3/2(×2GPU)3\times3/2(\times2_{GPU}) 13×13×128(×2GPU)13\times13\times128(\times2_{GPU} ) 0
Convolutional layer C3C_3 ^* 13×13×128×2GPU13\times13\times128\times2_{GPU} 3×3×256/1,192(×2GPU)3\times3\times256/1,192(\times2_{GPU}) 13×13×192(×2GPU)13\times13\times192(\times2_{GPU }) (3×3×256+1)×192×2(3\times3\times256+1)\times192\times2
Convolutional layer C4C_4 13×13×192(×2GPU)13\times13\times192(\times2_{GPU}) 3×3×192/1,192(×2GPU)3\times3\times192/1,192(\times2_{GPU}) $13\times13\times192(\times2_{GPU}) $ (3×3×192+1)×192×2(3\times3\times192+1)\times192\times2
Convolutional layer C5C_5 13×13×192(×2GPU)13\times13\times192(\times2_{GPU}) 3×3×192/1,128(×2GPU)3\times3\times192/1,128(\times2_{GPU}) $13\times13\times128(\times2_{GPU}) $ (3\times3\times192+1)\timEs128\times2
Downsampling layer SmaxS_{max} 13×13×128(×2GPU)13\times13\times128(\times2_{GPU}) 3×3/2(×2GPU)3\times3/2(\times2_{GPU}) 6×6×128(×2GPU)6\times6\times128(\times2_{GPU} ) 0
Fully connected layer F6F_6 ^* 6×6×128×2GPU6\times6\times128\times2_{GPU} 9216×2048(×2GPU)9216\times2048(\times2_{GPU}) 1×1×2048(×2GPU)1\times1\times2048(\times2_{GPU}) (9216+1)×2048×2(9216+1)\times2048\times2
Fully connected layer F7F_7 1×1×2048×2GPU1\times1\times2048\times2_{GPU} 4096×2048(×2GPU)4096\times2048(\times2_{GPU}) 1×1×2048(×2GPU)1\times1\times2048(\times2_{GPU}) (4096+1)×2048×2(4096+ 1)\times2048\times2
Full Connect Layer F8F_8 1×1×2048×2GPU1\times1\times2048\times2_{GPU} 4096×10004096\times1000 1×1×10001\times1\times1000 (4096+1)×1000×2(4096+1)\times1000\times2

Convolution layer C1C_1 input image data of 224×224×3224\times224\times3, respectively, after convolution convolution of 11×11×311\times11\times3 and stride of 4 in two GPUs. Get two separate output data for 55×55×4855\times55\times48.

The downsampling layer SmaxS_{max} is actually the largest pooling operation nested in the convolution, but is separately listed to distinguish the convolutional layers that do not have the largest pooling. After the pooling operation in the C12C_{1-2} convolutional layer (before the ReLU activation operation), there is also an LRN operation for normalization of adjacent feature points.

The input of the convolutional layer C3C_3 is different from other convolutional layers. 13×13×192×2GPU13\times13\times192\times2_{GPU} means that the output of the upper layer network on the two GPUs is collected as input, so the convolution is performed. The convolution kernel dimension on the channel during operation is 384.

The input data size of the fully connected layer F68F_{6-8} is similar to C3C_3, which is an input that combines the output of two GPU flows.

4.2.3 Model Features

  • All convolutional layers use ReLU as a nonlinear mapping function to make the model converge faster
  • Training of models on multiple GPUs not only improves the training speed of the model, but also increases the scale of data usage.
  • Normalize the local features using LRN, and as a result the input to the ReLU activation function can effectively reduce the error rate
  • Overlapping max pooling, ie the pooling range z has a relationship with the step size s z>sz>s (eg S\{max} in the kernel scale is 3×3/23\times3/2), avoiding average pooling Average effect of (average pooling)
  • Use random dropout to selectively ignore individual neurons in training to avoid overfitting of the model

4.3 ZFNet

4.3.1 Introduction to the model

ZFNet is a large convolutional network based on AlexNet from MatthewMatthew D.ZeilerD. Zeiler and RobRob FergusFergus. In the 2013 ILSVRC image classification competition, it won the championship with an error rate of 11.19% (actually the original ZFNet The team is not a real champion. The original ZFNet ranked 8th with a 13.51% error rate. The real champion is the ClarifaiClarifai team, and the CEO of a startup company corresponding to ClarifaiClarifai is ZeilerZeiler. And ClarifaiClarifai has a relatively small change to ZFNet, so it is generally considered that ZFNet won the championship) [34]^{[3-4]}. ZFNet is actually fine-tuning AlexNet, and visualizes the output features of each layer by means of deconvolution, further explaining why convolution operations are significant in large networks.

4.3.2 Model structure

Figure 4.4 ZFNet network structure diagram (original structure diagram and AlexNet style structure diagram)

As shown in Figure 4.4, ZFNet is similar to AlexNet. It is a convolutional neural network consisting of 8 layers of networks, including 5 layers of convolutional layers and 3 layers of fully connected layers. The biggest difference between the two network architectures is that the ZFNet first-layer convolution replaces the convolution of the first-order convolution kernel 11×11×3/411\times11\times3/4 in AlexNet with a convolution kernel of 7×7×3/27\times7\times3/2. nuclear. In Figure 4.5, ZFNet contains more intermediate frequency information in the feature map of the first layer output than AlexNet, while the characteristic map of the first layer output of AlexNet is mostly low frequency or high frequency information, and the lack of intermediate frequency features leads to The characteristics of the subsequent network level as shown in Figure 4.5© are not detailed enough, and the root cause of this problem is that the convolution kernel and step size adopted by AlexNet in the first layer are too large.

Figure 4.5 (a) Characteristic map of the first layer output of ZFNet (b) Characteristic map of the first layer output of AlexNet © Characteristic map of the output of the second layer of AlexNet (d) Characteristic map of the output of the second layer of ZFNet

Table 4.3 ZFNet Network Parameter Configuration

Network Layer Input Size Core Size Output Size Trainable Parameter Quantity
Convolutional Layer C1C_1 ^* 224×224×3224\times224\times3 7×7×3/2,967\times7\times3/2,96 110×110×96110\times110\times96 (7×7×3+1)×96(7\times7\times3+1)\times96
Downsampling layer SmaxS_{max} 110×110×96110\times110\times96 3×3/23\times3/2 55×55×9655\times55\times96 0
Convolutional Layer C2C_2 ^* 55×55×9655\times55\times96 5×5×96/2,2565\times5\times96/2,256 26×26×25626\times26\times256 (5×5×96+1)×256(5\times5\times96+1)\times256
Downsampling layer SmaxS_{max} 26×26×25626\times26\times256 3×3/23\times3/2 13×13×25613\times13\times256 0
Convolutional Layer C3C_3 13×13×25613\times13\times256 3×3×256/1,3843\times3\times256/1,384 13×13×38413\times13\times384 (3×3×256+1)×384(3\times3\times256+1)\times384
Convolutional layer C4C_4 13×13×38413\times13\times384 3×3×384/1,3843\times3\times384/1,384 13×13×38413\times13\times384 (3×3×384+1)×384(3\times3\times384+1)\times384
Convolutional layer C5C_5 13×13×38413\times13\times384 3×3×384/1,2563\times3\times384/1,256 13×13×25613\times13\times256 (3×3×384+1)×256(3\times3\times384+1)\times256
Downsampling layer SmaxS_{max} 13×13×25613\times13\times256 3×3/23\times3/2 6×6×2566\times6\times256 0
Full Connect Layer F6F_6 6×6×2566\times6\times256 9216×40969216\times4096 1×1×40961\times1\times4096 (9216+1)×4096(9216+1)\times4096
Full Connect Layer F7F_7 1×1×40961\times1\times4096 4096×40964096\times4096 1×1×40961\times1\times4096 (4096+1)×4096(4096+1)\times4096
Full Connect Layer F8F_8 1×1×40961\times1\times4096 4096×10004096\times1000 1×1×10001\times1\times1000 (4096+1)×1000(4096+1)\times1000

Convolutional layer C1C_1 is different from C1C_1 in AlexNet, using 7×7×3/27\times7\times3/2 convolution kernel instead of $11\times11\times3/4 $ to make the first layer convolution output The result can include more medium frequency features, providing more choices for a diverse set of features in subsequent network layers, facilitating the capture of more detailed features.

The convolutional layer C2C_2 uses a convolution kernel of step size 2, which is different from the convolution kernel step size of C2C_2 in AlexNet, so the output dimensions are different.

4.3.3 Model Features

ZFNet and AlexNet are almost identical in structure. Although this part belongs to the model characteristics, it should be accurately the contribution of visualization technology in the original ZFNet paper.

  • Visualization techniques reveal individual feature maps for each layer in the excitation model.
  • Visualization techniques allow observation of the evolution of features during the training phase and the diagnosis of potential problems with the model.
  • Visualization technology uses a multi-layer deconvolution network that returns to the input pixel space by feature activation.
  • Visualization techniques perform sensitivity analysis of the classifier output by revealing that part of the input image reveals which part is important for classification.
  • The visualization technique provides a non-parametric invariance to show which piece of the training set activates which feature map, not only the cropped input image, but also a top-down projection to expose a feature map from each block.
  • Visualization techniques rely on deconvolution operations, the inverse of convolution operations, to map features onto pixels.

4.4 Network in Network

4.4.1 Introduction to the model

NetwThe ork In Network (NIN) was proposed by MinLinMin Lin et al. to achieve the best level at the time of the CIFAR-10 and CIFAR-100 classification tasks, as its network structure was made up of three multi-layered perceptron stacks. NIN[5]^{[5]}. NIN examines the convolution kernel design in convolutional neural networks from a new perspective, and replaces the linear mapping part in pure convolution by introducing subnetwork structures. This form of network structure stimulates more complex convolutional neural networks. The structural design of GoogLeNet’s Inception structure introduced in the next section is derived from this idea.

4.4.2 Model Structure


Figure 4.6 NIN network structure

NIN consists of three layers of multi-layer perceptual convolutional layer (MLPConv Layer). Each layer of multi-layer perceptual convolutional layer is composed of several layers of local fully connected layers and nonlinear activation functions instead of traditional convolutional layers. Linear convolution kernel used. In network inference, the multi-layer perceptron calculates the local features of the input feature map, and the weights of the products corresponding to the local feature maps of each window are shared. The convolution operation is completely consistent, the biggest difference is that the multilayer perceptron performs a nonlinear mapping of local features, while the traditional convolution method is linear. NIN’s network parameter configuration table 4.4 is shown (the original paper does not give the network parameters, the parameters in the table are the compiler combined network structure diagram and CIFAR-100 data set with 3×33\times3 convolution as an example).

Table 4.4 NIN network parameter configuration (combined with the original paper NIN structure and CIFAR-100 data)

Network Layer Input Size Core Size Output Size Number of Parameters
Local Fully Connected Layer L11L_{11} ^* 32×32×332\times32\times3 (3×3)×16/1(3\times3)\times16/1 30×30×1630\times30\times16 (3×3×3+1)×16(3\times3\times3+ 1)\times16
Fully connected layer L12L_{12} ^* 30×30×1630\times30\times16 16×1616\times16 30×30×1630\times30\times16 ((16+1)×16)((16+1)\times16)
Local Full Connection Layer L21L_{21} 30×30×1630\times30\times16 (3×3)×64/1(3\times3)\times64/1 28×28×6428\times28\times64 $(3\times3\times16+1)\times64 $
Fully connected layer L22L_{22} 28×28×6428\times28\times64 64×6464\times64 28×28×6428\times28\times64 ((64+1)×64)((64+1)\times64)
Local Full Connection Layer L31L_{31} 28×28×6428\times28\times64 (3×3)×100/1(3\times3)\times100/1 26×26×10026\times26\times100 $(3\times3\times64+1)\times100 $
Fully connected layer L32L_{32} 26×26×10026\times26\times100 100×100100\times100 26×26×10026\times26\times100 ((100+1)×100)((100+1)\times100)
Global Average Sampling GAPGAP ^* 26×26×10026\times26\times100 26×26×100/126\times26\times100/1 1×1×1001\times1\times100 00

The local fully connected layer L11L_{11} is actually a windowed full join operation on the original input image, so the windowed output feature size is 30×3030\times30(323k+11stride=30\frac{32-3_k+1 }{1_{stride}}=30)
The fully connected layer L12L_{12} is a fully connected operation immediately following L11L_{11}. The input feature is the activated local response feature after windowing, so only need to connect L11L_{11} and The node of L12L_{12} is sufficient, and each partial fully connected layer and the immediately connected fully connected layer constitute a multilayer perceptual convolutional layer (MLPConv) instead of a convolution operation.
The global average sampling layer or the global average pooling layer GAPGAP(Global Average Pooling) performs a global average pooling operation on each feature map output by L32L_{32}, directly obtaining the final number of categories, which can effectively Reduce the amount of parameters.

4.4.3 Model Features

  • The use of a multi-layer perceptron structure instead of convolution filtering operation not only effectively reduces the problem of excessive parameterization caused by excessive convolution kernels, but also improves the abstraction ability of the model by introducing nonlinear mapping.
  • Using global average pooling instead of the last fully connected layer, can effectively reduce the amount of parameters (no trainable parameters), while pooling uses the information of the entire feature map, which is more robust to the transformation of spatial information, and finally obtained The output can be directly used as a confidence level for the corresponding category.

4.5 VGGNet

4.5.1 Introduction to the model

VGGNet is a deep convolutional network structure proposed by the Visual Geometry Group (VGG) of Oxford University. They won the runner-up of the 2014 ILSVRC classification task with a 7.32% error rate (the champion was 6.65% by GoogLeNet). The error rate was won) and the error rate of 25.32% won the first place in the Localization (GoogLeNet error rate was 26.44%) [5]^{[5]}, and the network name VGGNet was taken from the group name abbreviation. VGGNet was the first to reduce the error rate of image classification to less than 10%, and the idea of ​​the 3×33\times3 convolution kernel used in the network was the basis of many later models published at the 2015 International Conference on Learning Representation ( International Conference On Learning Representations (ICLR) has been cited more than 14,000 times since.

4.5.2 Model structure

Figure 4.7 VGG16 network structure

In the original paper, VGGNet contains six versions of evolution, corresponding to VGG11, VGG11-LRN, VGG13, VGG16-1, VGG16-3 and VGG19, respectively. Different suffix values ​​indicate different network layers (VGG11-LRN is represented in In the first layer, VGG11 of LRN is used. VGG16-1 indicates that the convolution kernel size of the last three convolutional blocks is 1×11\times1, and the corresponding VGG16-3 indicates that the convolution kernel size is 3\. Times3), the VGG16 introduced in this section is VGG16-3. The VGG16 in Figure 4.7 embodies the core idea of ​​VGGNet, using the convolution combination of 3×33\times3 instead of the large convolution (2 $3\times3 convolutions can have the same perception field as the $$5\times5$ convolution ), network parameter settings are shown in Table 4.5.

Table 4.5 VGG16 network parameter configuration

Network Layer Input Size Core Size Output Size Number of Parameters
Convolutional Layer C11C_{11} 224×224×3224\times224\times3 3×3×64/13\times3\times64/1 224×224×64224\times224\times64 (3×3×3+1)×64(3\times3\times3+1)\times64
Convolutional Layer C12C_{12} 224×224×64224\times224\times64 3×3×64/13\times3\times64/1 224×224×64224\times224\times64 (3×3×64+1)×64(3\times3\times64+1)\times64
Downsampling layer Smax1S_{max1} 224×224×64224\times224\times64 2×2/22\times2/2 112×112×64112\times112\times64 00
Convolutional layer C21C_{21} 112×112×64112\times112\times64 3×3×128/13\times3\times128/1 112×112×128112\times112\times128 (3×3×64+1)×128(3\times3\times64+1)\times128
Convolutional Layer C22C_{22} 112×112×128112\times112\times128 3×3×128/13\times3\times128/1 112×112×128112\times112\times128 (3×3×128+1)×128(3\times3\times128+1)\times128
Downsampling layer Smax2S_{max2} 112×112×128112\times112\times128 2×2/22\times2/2 56×56×12856\times56\times128 00
Convolutional layer C31C_{31} 56×56×12856\times56\times128 3×3×256/13\times3\times256/1 56×56×25656\times56\times256 (3×3×128+1)×256(3\times3\times128+1)\times256
Convolutional layer C32C_{32} 56×56×25656\times56\times256 3×3×256/13\times3\times256/1 56×56×25656\times56\times256 (3×3×256+1)×256(3\times3\times256+1)\times256
Convolutional layer C33C_{33} 56×56×25656\times56\times256 26×26×256/126\times26\times256/1 56×56×25656\times56\times256 (3×3×256+1)×256(3\times3\times256+1)\times256
Downsampling layer Smax3S_{max3} 56×56×25656\times56\times256 2×2/22\times2/2 28×28×25628\times28\times256 00
Convolutional Layer C41C_{41} 28×28×25628\times28\times256 3×3×512/13\times3\times512/1 28×28×51228\times28\times512 (3×3×256+1)×512(3\times3\times256+1)\times512
Convolutional Layer C42C_{42} 28×28×51228\times28\times512 3×3×512/13\times3\times512/1 28×28×51228\times28\times512 (3×3×512+1)×512(3\times3\times512+1)\times512
Convolutional Layer C43C_{43} 28×28×51228\times28\times512 3×3×512/13\times3\times512/1 28×28×51228\times28\times512 (3×3×512+1)×512(3\times3\times512+1)\times512
Downsampling layer Smax4S_{max4} 28×28×51228\times28\times512 2×2/22\times2/2 14×14×51214\times14\times512 00
Convolutional Layer C51C_{51} 14×14×51214\times14\times512 3×3×512/13\times3\times512/1 14×14×51214\times14\times512 (3×3×512+1)×512(3\times3\times512+1)\times512
Convolutional Layer C52C_{52} 14×14×51214\times14\times512 3×3×512/13\times3\times512/1 14×14×51214\times14\times512 (3×3×512+1)×512(3\times3\times512+1)\times512
Downsampling layer Smax5S_{max5} 14×14×51214\times14\times512 2×2/22\times2/2 7×7×5127\times7\times512 00
Fully connected layer FC1FC_{1} 7×7×5127\times7\times512 (7×7×512)×4096(7\times7\times512)\times4096 1×40961\times4096 (7×7×512+1)×4096(7\times7\times512+1)\times4096
Fully connected layer FC2FC_{2} 1×40961\times4096 4096×40964096\times4096 1×40961\times4096 (4096+1)×4096(4096+1)\times4096
Fully connected layer FC3FC_{3} 1×40961\times4096 4096×10004096\times1000 1×10001\times1000 (4096+1)×1000(4096+1)\times1000

4.5.3 Model Features

  • The entire network uses the same size convolution kernel size 3×33\times3 and the maximum pooled size 2×22\times2.
  • The meaning of 1×11\times1convolution is mainly linear transformation, while the number of input channels and the number of output channels are unchanged, and no dimensionality reduction occurs.
  • Two convolutional layers of 3×33\times3 are concatenated as a convolutional layer of 5×55\times5 with a receptive field size of 5×55\times5. Similarly, the concatenation of three 3×33\times3 convolutions is equivalent to a convolutional layer of 7×77\times7. This type of connection makes the network parameters smaller, and the multi-layer activation function makes the network more capable of learning features.
  • VGGNet has a trick in training. It first trains the shallow simple network VGG11, and then reuses the weight of VGG11 to initialize VGG13. This training and initialization VGG19 can make the convergence faster during training.
  • Use multi-scale transformations in the training process to enhance the data of the original data, making the model difficult to overfit.

4.6 GoogLeNet

4.6.1 Introduction to the model

As the winner of the ILSVRC classification task in 2014, GoogLeNet pressured VGGNet and other models with an error rate of 6.65%. Compared with the previous two championships ZFNet and AlexNet, the accuracy of the classification is greatly improved. From the name GoogLeNet, you can know that this is a network structure designed by Google engineers, and the name GoogLeNet is a tribute to LeNet[0]^{[0]}. The core part of GoogLeNet is its internal subnet structure, Inception, which is inspired by NIN and has undergone four iterations (Inceptionv14_{v1-4}).


Figure 4.8 Inception performance comparison chart

4.6.2 Model Structure


Figure 4.9 GoogLeNet network structure
As shown in Figure 4.9, GoogLeNet extends the width of the network in addition to the depth of the previous convolutional neural network structure. The entire network is composed of a number of block subnetworks. This subnet constitutes the Inception structure. Figure 4.9 shows four versions of Inception: Inceptionv1Inception_{v1} uses different convolution kernels in the same layer and merges the convolution results; Inceptionv2Inception_{v2} combines the stacking of different convolution kernels, and The convolution results are merged; Inceptionv3Inception_{v3} is a deep combination attempt on the basis of v2v_2; the $Inception_{v4} $ structure is more complex than the previous version, nested in the subnet The internet.





Figure 4.10 Inceptionv14_{v1-4} structure diagram

Table 4.6 Inceptionv1_{v1} Network Parameter Configuration in GoogLeNet

Network Layer Input Size Core Size Output Size Number of Parameters
Convolutional layer C11C_{11} H×W×C1H\times{W}\times{C_1} 1×1×C2/21\times1\times{C_2}/2 H2×W2×C2\frac{H}{2}\times\frac {W}{2}\times{C_2} (1×1×C1+1)×C2(1\times1\times{C_1}+1)\times{C_2}
Convolutional layer C21C_{21} H×W×C2H\times{W}\times{C_2} 1×1×C2/21\times1\times{C_2}/2 H2×W2×C2\frac{H}{2}\times\frac {W}{2}\times{C_2} (1×1×C2+1)×C2(1\times1\times{C_2}+1)\times{C_2}
Convolutional Layer C22C_{22} H×W×C2H\times{W}\times{C_2} 3×3×C2/13\times3\times{C_2}/1 H×W×C2/1H\times{W}\times{C_2}/ 1 (3×3×C2+1)×C2(3\times3\times{C_2}+1)\times{C_2}
Convolutional layer C31C_{31} H×W×C1H\times{W}\times{C_1} 1×1×C2/21\times1\times{C_2}/2 H2×W2×C2\frac{H}{2}\times\frac {W}{2}\times{C_2} (1×1×C1+1)×C2(1\times1\times{C_1}+1)\times{C_2}
Convolutional Layer C32C_{32} H×W×C2H\times{W}\times{C_2} 5×5×C2/15\times5\times{C_2}/1 H×W×C2/1H\times{W}\times{C_2}/ 1 (5×5×C2+1)×C2(5\times5\times{C_2}+1)\times{C_2}
Downsampling layer S41S_{41} H×W×C1H\times{W}\times{C_1} 3×3/23\times3/2 H2×W2×C2\frac{H}{2}\times\frac{W}{2 }\times{C_2} 00
Convolutional Layer C42C_{42} H2×W2×C2\frac{H}{2}\times\frac{W}{2}\times{C_2} 1×1×C2/11\times1\times{C_2}/1 $ \frac{H}{2}\times\frac{W}{2}\times{C_2}$ (3×3×C2+1)×C2(3\times3\times{C_2}+1)\times{C_2}
Merge layer MM H2×W2×C2(×4)\frac{H}{2}\times\frac{W}{2}\times{C_2}(\times4) Stitching H2×W2×(C2×4)\frac{H}{2}\times \frac{W}{2}\times({C_2}\times4) 00

4.6.3 Model Features

  • The use of convolution kernels of different sizes means different sizes of receptive fields, and the final splicing means the fusion of different scale features;
  • The reason why the convolution kernel size is 1, 3 and 5 is mainly for the convenience of alignment. After setting the convolution step stride=1, as long as pad=0, 1, 2 are set respectively, the features of the same dimension can be obtained after convolution, and then these features can be directly stitched together;
  • The more the network goes to the end, the more abstract the features are, and the more susceptible fields are involved in each feature. As the number of layers increases, the proportion of 3x3 and 5x5 convolutions also increases. However, using a 5x5 convolution kernel still has a huge amount of computation. To this end, the article draws on NIN2 and uses a 1x1 convolution kernel for dimensionality reduction.

4.7 Why are the current CNN models adjusted on GoogleNet, VGGNet or AlexNet?

  • Evaluation comparison: In order to make your results more convincing, when you publish your own results, you will compare them with a standard baseline and improve on the baseline. Common problems such as detection and segmentation will be based on VGG or Resnet101. Basic network.
  • Limited time and effort: In research pressure and work stress, time and energy only allow you to explore in a limited range.
  • Model innovation is difficult: Improving the basic model requires a lot of experimentation and experimentation, and requires a lot of experiment accumulation and strong inspiration. It is very likely that the input-output ratio is relatively small.
  • Resource limitations: Creating a new model requires a lot of time and computing resources, often not feasible in schools and small business teams.
  • In actual application scenarios, there are actually a large number of non-standard model configurations.

[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998.

[2] A. Krizhevsky, I. Sutskever and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25. Curran Associates, Inc. 1097–1105.

[3] LSVRC-2013. http://www.image-net.org/challenges/LSVRC/2013/results.php

[4] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. European Conference on Computer Vision.

[5] M. Lin, Q. Chen, and S. Yan. Network in network. Computing Research Repository, abs/1312.4400, 2013.

[6] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Machine Learning, 2015.

[7] Bharath Raj. [a-simple-guide-to-the-versions-of-the-inception-network] (https://towardsdatascience.com/a-simple-guide-to-the-versions-of- The-inception-network-7fc52b863202), 2018.

[8] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Inception-v4, Inception-ResNet and
the Impact of Residual Connections on Learning
, 2016.

[9] Sik-Ho Tsang. review-inception-v4-evolved-from-googlenet-merged-with-resnet-idea-image-classification, 2018.

[10] Zbigniew Wojna, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens. Rethinking the Inception Architecture for Computer Vision, 2015.

[11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Going deeper with convolutions, 2014.