Convolutional Neural Networks (CNNs)
A convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks, based on their shared-weights architecture and translation invariance characteristics
Resources
- https://github.com/kjw0612/awesome-deep-vision
- https://en.wikipedia.org/wiki/Convolutional_neural_network
- CNNs chapter in d2l.ai
- Convolutional Neural Networks cheatsheet
- http://cs231n.github.io/convolutional-networks/
- http://cs231n.github.io/understanding-cnn/
- Deep Convolutional Neural Networks as Models of the Visual System: Q&A | Grace W. Lindsay
Convolutions
- Understanding convolutions
- An Introduction to different Types of Convolutions in DL
- A Comprehensive Introduction to Different Types of Convolutions in Deep Learning | by Kunlun Bai | Towards Data Science
- An Introduction to different Types of Convolutions in Deep Learning | by Paul-Louis Pröve | Towards Data Science
- Depthwise separable convolution
- Intuitive understanding of 1D, 2D, and 3D convolutions in convolutional neural networks - Stack Overflow
- https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807
- Convolutions Over Volumes (channels)
1x1 convolutions
- 1x1 convolutions: https://d2l.ai/chapter_convolutional-neural-networks/channels.html#times-1-convolutional-layer
- 1x1 convolutions
- Networks in Networks and 1x1 Convolutions
- One by One [ 1 x 1 ] Convolution - counter-intuitively useful – Aaditya Prakash (Adi) – Machine Learning
- 1x1 Convolutions: Demystified by Anwesh Marwade | Towards Data Science
- 1X1 Convolution, CNN, CV, Neural Networks | Analytics Vidhya
- A Gentle Introduction to 1x1 Convolutions to Manage Model Complexity - MachineLearningMastery.com
- A convolutional layer with a 1×1 filter is used at any point in a CNN to control the number of feature maps. It's often referred to as a projection operation or projection layer, or even a feature map or channel pooling layer
Human pose estimation and activity recognition
- https://en.wikipedia.org/wiki/Activity_recognition
- https://machinelearningmastery.com/deep-learning-models-for-human-activity-recognition/
- https://github.com/cbsudux/awesome-human-pose-estimation
- https://www.learnopencv.com/deep-learning-based-human-pose-estimation-using-opencv-cpp-python/
Code
- #CODE Modern Convolutional Neural Network Architectures
- AlexNet, VGG, GoogleNet, ResNet, ResNeXT, Xception, MobileNet, EfficientNet, RegNet, ConvMixer, ConvNEXT
- #CODE Keras Layers (for TensorFlow 2.x)
- #CODE Model Zoo - Discover open source deep learning code and pretrained models
- #CODE https://github.com/microsoft/computervision-recipes
Channel/visual attention
- #CODE Visual-attention-tf
- Pixel Attention
- Channel Attention (CBAM)
- Efficient Channel Attention
- #CODE Convolution Variants
- Attention Augmented (AA) Convolution Layer
- Mixed Depthwise Convolution Layer
- Drop Block
- Efficient Channel Attention (ECA) Layer
- Convolutional Block Attention Module (CBAM) Layer
References
- #PAPER A guide to convolution arithmetic for deep learning (Dumoulin, 2016)
- #PAPER Xception: Deep Learning with Depthwise Separable Convolutions (Chollet 2017)
- #PAPER Deformable Convolutional Networks (Dai 2017)
- #PAPER Deformable ConvNets v2: More Deformable, Better Results (Zhu 2018)
- #PAPER 3D Depthwise Convolution: Reducing Model Parameters in 3D Vision Tasks (Ye 2018)
- #PAPER Making Convolutional Networks Shift-Invariant Again (Zhang 2019)
- #PAPER A Survey of the Recent Architectures of Deep Convolutional Neural Networks (Khan 2020)
- #PAPER Revisiting Spatial Invariance with Low-Rank Local Connectivity (Elsayed 2020)
- #THESIS/PHD Multi-modal Medical Image Processing with Applications in HybridX-ray/Magnetic Resonance Imaging (Stimpel 2021)
- #PAPER Learning to Resize Images for Computer Vision Tasks (Talebi 2021)
- #PAPER Non-deep Networks (Goyal 2021)
- #CODE https://paperswithcode.com/paper/non-deep-networks-1?from=n19
- use parallel subnetworks instead of stacking one layer after another. This helps effectively reduce depth while maintaining high performance
- #PAPER ConvNext: A ConvNet for the 2020s (Liu 2022)
- #CODE https://github.com/facebookresearch/ConvNeXt
- #CODE https://github.com/bamps53/convnext-tf/
- #CODE https://github.com/sayakpaul/ConvNeXt-TF
- Paper explained:
- https://twitter.com/papers_daily/status/1481937771732566021
- ConvNeXt essentially takes a ResNet and gradually "modernizes" it to discover components that contribute to performance gains. ConvNeXt applies several tricks like larger kernels, layer norm, fewer activation functions, separate downsampling layers to name a few.
- These results show that hybrid models are promising and that different components can still be optimized further and composed more effectively to improve the overall model on a wide range of vision tasks.
- #PAPER Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs (Ding 2022)
- #PAPER More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity (Liu 2022)
- #CODE https://github.com/vita-group/slak
- explore the possibility of training extreme convolutions larger than 31×31 and test whether the performance gap can be eliminated by strategically enlarging convolutions
- propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with 51×51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as typical downstream tasks
Sequence (time series) modelling
- #PAPER An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (Bai 2018)
- Temporal convolutional networks (TCN)
- #CODE https://github.com/philipperemy/keras-tcn
- Implementing Temporal Convolutional Networks
- The most important component of TCNs is dilated causal convolution. “Causal” simply means a filter at time step t can only see inputs that are no later than t. The point of using dilated convolution is to achieve larger receptive field with fewer parameters and fewer layers.
- A residual block stacks two dilated causal convolution layers together, and the results from the final convolution are added back to the inputs to obtain the outputs of the block.
- Temporal convolutional networks for sequence modeling
- #PAPER Convolutions Are All You Need (For Classifying Character Sequences) (Wood-doughty 2018)
- #PAPER InceptionTime: Finding AlexNet for time series classification (Fawaz 2021)
Object classification, image recognition
See AI/Computer Vision/Object classification, image recognition
Semantic segmentation
See AI/Computer Vision/Semantic segmentation
Object detection
See AI/Computer Vision/Object detection
Video segmentation and prediction
See AI/Computer Vision/Video segmentation and prediction
Image and video captioning
See AI/Computer Vision/Image and video captioning
Image-to-image translation
See AI/Computer Vision/Image-to-image translation
Super-resolution
See AI/Computer Vision/Super-resolution#Supervised CNN-based
Inpainting
See AI/Computer Vision/Inpainting and restoration#CNN-based
Background subtraction, foreground detection
See AI/Computer Vision/Background subtraction#CNN based
Edge detection
- #PAPER DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection
- #PAPER DeepContour: A Deep Convolutional Feature Learned by Positive-Sharing Loss for Contour Detection
Human pose estimation and activity recognition
- #PAPER Human activity recognition with smartphone sensors using deep learning neural networks (Ann Ronao 2016)
- #PAPER Convolutional pose machines (Wei 2016)
- #PAPER Fast Human Pose Estimation (Zhang 2019)
Motion detection, tracking
Deconvolution
Visual/Channel attention and Saliency
See AI/XAI#Explainability methods for Neural Networks
- #PAPER Squeeze-and-Excitation Networks, SENets (Hu 2017)
- Features can incorporate global context
- Since SENet only revolves around providing channel attention by using dedicated global feature descriptors, which in this case is Global Average Pooling (GAP), there is a loss of information and the attention provided is point-wise. This means that all pixels are mapped in the spatial domain of a feature map uniformly, and thus not discriminating between important or class-deterministic pixels versus those which are part of the background or not containing useful information.
- Thus, the importance/need for spatial attention is justified to be coupled with channel attention. One of the prime examples of the same is CBAM (published at ECCV 2018)
- #CODE https://github.com/hujie-frank/SENet
- #CODE https://github.com/yoheikikuta/senet-keras
- https://blog.paperspace.com/channel-attention-squeeze-and-excitation-networks/
- https://programmerclick.com/article/4934219785/
- https://pyimagesearch.com/2022/05/30/attending-to-channels-using-keras-and-tensorflow/
- #PAPER CBAM: Convolutional Block Attention Module (Woo 2018)
- #PAPER ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks (Wang 2020)
- this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain
- proposed a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution
- #PAPER See
{ #srwithpixelattention}
in Super-resolution - #PAPER Attention Mechanisms in Computer Vision: A Survey (Guo 2021)
- #PAPER Visual Attention Network (Guo 2022)
- #CODE https://paperswithcode.com/paper/visual-attention-network?from=n26
- This work presents an approach that decomposes a large kernel convolution operation to capture long-range relationship. After obtaining long-range relationship, it estimates the importance of a point and generates attention map
- #PAPER Attention Map-Guided Visual Explanations for Deep Neural Networks (An 2022)
- attention-map-guided visual explanations for deep neural networks, employing an attention mechanism to find the most important region of an input image
- The Grad-CAM method is used to extract the feature map for deep neural networks, and then the attention mechanism is used to extract the high-level attention maps
- Inspired in CBAM technique