EfficientDet: When Object Detection Meets Scalability and Efficiency
Last Updated on July 20, 2023 by Editorial Team
Author(s): Aniket Maurya
Originally published on Towards AI.
EfficientDet, a highly efficient and scalable state of the art object detection model developed by Google Research, Brain Team. It is not just a single model. It has a family of detectors which achieve better accuracy with an order-of-magnitude fewer parameters and FLOPS than previous object detectors.
EfficientDet paper has mentioned its 7 family members.
Comparison of EfficientDet detectors[0β6] with other SOTA object detection models.
Quick Overview of the Paper
- EfficientNet is the backbone architecture used in the model. EfficientNet is also written by the same authors at Google. Conventional CNN models arbitrarily scaled network dimensions- width, depth and resolution. EfficientNet uniformly scales each dimension with a fixed set of scaling coefficients. It surpassed SOTA accuracy with 10x efficiency.
- BiFPN: While fusing (applying residual or skip connections) different input features, most of the works simply summed them up without any distinction. Since both input features are at the different resolutions they donβt equally contribute to the fused output layer. The paper proposes a weighted bi-directional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features.
- Compound Scaling: For higher accuracy previous object detection models relied on β bigger backbone or larger input image sizes. Compound Scaling is a method that uses a simple compound coefficient Ο to jointly scale-up all dimensions of the backbone network, BiFPN network, class/box network, and resolution.
Combining EfficientNet backbones with our propose BiFPN and compound scaling, we have developed a new family of object detectors, named EfficientDet, which consistently achieve better accuracy with an order-of-magnitude fewer parameters and FLOPS than previous object detectors.
BiFPN
Conventional FPN (Feature Pyramid Network) is limited by the one-way information flow. PANet added an extra bottom-up path for information flow. PANet achieved better accuracy but with the cost and more parameters and computations. The paper proposed several optimizations for cross-scale connections:
- Remove Nodes that only have one input edge.
If a node has only one input edge with no feature fusion, then it will have less contribution to the feature network that aims at fusing different features. - Add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost.
- Treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.
Weighted Feature Fusion
While multi-scale fusion, input features are not simply summed up. The authors proposed to add additional weight for each input during feature fusion and let the network to learn the importance of each input feature. Out of three weighted fusion approaches β
Unbounded fusion:
Where W is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). Since the scalar weight is unbounded, it could potentially cause training instability. So, Softmax-based fusion was tried for normalized weights.
Softmax-based fusion:
As softmax normalizes the weights to be the probability of range 0 to 1 which can denote the importance of each input. The softmax leads to a slowdown on GPU.
Fast normalized fusion:
Π is added for numeric stability. It is 30% faster on GPU and gave almost as accurate results as softmax.
Final BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion.
EfficientDet Architecture
EfficientDet follows one-stage-detection paradigm. A pre-trained EfficientNet backbone is used with BiFPN as the feature extractor. BiFPNN takes {P3, P4, P5, P6, P7} features from the EfficientNet backbone network and repeatedly applies bidirectional feature fusion.
The fused features are fed to a class and bounding box network for predicting object class and bounding box.
EfficientDet: Scalable and Efficient Object Detection
Model efficiency has become increasingly important in computer vision. In this paper, we systematically study variousβ¦
arxiv.org
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up forβ¦
arxiv.org
Path Aggregation Network for Instance Segmentation
The way that information propagates in neural networks is of great importance. In this paper, we propose Pathβ¦
arxiv.org
Deep Residual Learning for Image Recognition
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training ofβ¦
arxiv.org
Hope you liked the article.
U+1F449 Twitter: https://twitter.com/aniketmaurya
U+1F449 Mail: [email protected]
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI