[Articles] ImageNet Classification with Deep Convolutional Neural Networks

About this Article

Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Journal: Advances in Neural Information Processing Systems
Year: 2012
Citation: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.”. Advances in neural information processing systems 25 (2012).

Accomplishments

Used a Convolutional Neural Network (CNN) to train an image classifier.
Devised a publicly available GPU implementation of 2D convolution.

Key Points

1. CNN Architecture (AlexNet)

The architecture consists of five convolutional layers and three fully-connected layers.

Layer	1st	2nd	3rd	4th	5th	Fully-Connected Layer (3)
Input Size	$224\times224\times3$	$27\times27$	$13\times13$	$13\times13$	$13\times13$
Kernel Size	$11\times11\times3$	$5\times5\times48$	$3\times3\times256$	$3\times3\times192$	$3\times3\times192$
Kernel Num	96	256	384	384	256	4096 neurons each
Stride	4
Connection		only same GPU	all maps of 2nd layer	only same GPU	only same GPU	all neurons in the previous layer
LRN	O	O	X	X	X	X
Max Pooling	O	O	X	X	O	X
ReLU	O	O	O	O	O	O
Others	-	-	-	-	-	Dropout is applied for the first two layers. Last layer is 1000-way softmax.

2. Key Features of the Architecture

(In order of importance as stated in the paper)

ReLU Nonlinearity
- Instead of sigmoid or tanh functions, the authors used ReLU ($f(x)=max(0,x)$).
- Enables faster training than tanh.
- It’s unnecessary to have input normalization to prevent saturation, but AlexNet used LRN for better generalization.
Training on Multiple GPUs
- The authors spread the network across two GPUs (GPU parallelization).
- The GPUs only communicate in certain layers to precisely tune the amount of communication.
- This reduces top-1 and top-5 error rates by 1.7% and 1.2%, respectively.
Local Response Normalization (LRN)
- Used to normalize the output of ReLU neurons, as they can become too large.
- It creates competition among neighboring neurons, which improves generalization performance.
- \[b_{x,y}^{i}=a_{x,y}^{i}/(k+\alpha\sum_{j=max(0,i-n/2)}^{min(N-1,i+n/2)}(a_{x,y}^{j})^{2})^{\beta}\]
Overlapping Pooling
- Unlike traditional non-overlapping pooling, the authors used overlapping pooling where the stride is less than the kernel size (s=2, z=3).
- This scheme helps to reduce overfitting slightly.

3. Reducing Overfitting

Data Augmentation
- The authors used two forms of data augmentation.
- Generating image translations and horizontal reflections**: Random $224\times224$ patches and their reflections were extracted for training. For testing, five specific patches and their reflections were used.
- Altering the intensities of the RGB channels: Performed PCA on the RGB pixel values and added multiples of the principal components to each training image, making the model invariant to changes in illumination.
- \[[p_{1},p_{2},p_{3}][\alpha_{1}\lambda_{1},\alpha_{2}\lambda_{2},\alpha_{3}\lambda_{3}]^{T}\]
Dropout
- Reduces complex co-adaptations of neurons by setting the output of each hidden neuron to zero with a probability of 0.5.

4. Results

*ILSVRC-2010: Achieved a top-1 error rate of 37.5% and a top-5 error rate of 17.0%, significantly outperforming the previous state-of-the-art (47.1% and 28.2%).

ILSVRC-2012: An ensemble of seven CNNs achieved a top-5 error rate of 15.3%, winning the competition. The previous best was 26.2%.
Fall 2009 Dataset: Achieved 67.4% (top-1) and 40.9% (top-5) error rates, compared to the previous best of 78.1% and 60.9%.

5. Qualitative Evaluation

The two GPUs specialized spontaneously: one learned color-agnostic features, while the other learned color-specific features.
The network successfully recognized off-center objects.
Images that were semantically similar but different at the pixel level were considered similar by the network’s feature vectors.

6. Limitations & Further Research

Depth is crucial: removing any single convolutional layer resulted in a significant loss of performance.
The authors did not use unsupervised pre-training, suggesting there is still room for improvement.
The ultimate goal is to use very large and deep CNNs on video sequences.

Full Article

Visit here

Twitter Facebook LinkedIn

[Articles] ImageNet Classification with Deep Convolutional Neural Networks

HJ

About this Article

Accomplishments

Key Points

1. CNN Architecture (AlexNet)

2. Key Features of the Architecture

3. Reducing Overfitting

4. Results

5. Qualitative Evaluation

6. Limitations & Further Research

Full Article

공유하기

댓글남기기

참고

[Articles] Know “No” Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

[Articles] AIVariant: a deep learning-based somatic variant detector for highly contaminated tumor samples

[Articles] ProPILE: Probing Privacy Leakage in Large Language Models

[Articles] ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models