Explained: Reverse Attention Network (RAN) in Image Segmentation
Last Updated on October 1, 2022 by Editorial Team
Author(s): Leo Wang
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Table OfΒ Contents
Β· βοΈ Problems
Β· βοΈ A Solution
Β· β οΈReverse Attention Network (RAN)
β Reverse Branch (RB)
β Reverse Attention Branch (RAB)
β Combine the result
Β· βοΈ Training
Β· βοΈ Performance
Β·Β Citation
βοΈ Problems
- Most CNN-based semantic segmentation methods focus on simply getting the predictions right without mechanisms teaching the model to discern the difference between classes. (so characteristics of less common classes might beΒ ignored)
- High-level features are shared in different classes due to the visual similarity among classes, which may yield confusing results in regions containing the boundaries of different classes (e.g., background with an object because they have similar activation strength) or when they are mixed together.
To have a better understanding of the problem, please see Fig. 1. As seen from the attention heatmap, it is obvious that most nowadays encoder-decoder models would have strong neural activations on parts that two objects are βmixedβ together (aka. have obscure boundaries or regions where 2+ objects share similar spatial patterns), where the model should not pay too much attention on those βmixedβ parts during predictions atΒ all.
βοΈ AΒ Solution
- The authors devised a mechanism to identify those mixed special regions and amplify the weaker activations to capture the target object, so the network learns not only to discern the background class but also learns to discern different objects all present in theΒ image.
Therefore, they proposed a novel architecture and dubbed itβReverse Attention Networkβ (RAN) to address the aforementioned problems.
In RAN, there are two different branches (one circled in red and one circled in blue) designed to learn background features and the objectβs features, respectively.
To further highlight the knowledge learned from the object class, a reverse attention structure is designated to generate per-class masks to amplify the object classβs activations in the confusedΒ region.
Lastly, the predictions are fused together to yield the final prediction.
β οΈReverse Attention NetworkΒ (RAN)
To have a more detailed understanding of the proposed model, please see Fig.Β 3.
To break down the process into a few steps after the input image isΒ given:
- A feature map is generated using a selected model architecture (Usually ResNet-101 or VGG16, but it can vary) to learn object features.
- Then, the map is split into two branches.
Reverse BranchΒ (RB)
- Colored in yellow, the model first trains a CONV_rev layer to learn the βreverse object classβ explicitly (the reverse object class is the reversed ground truth for the objectΒ class).
- In order to get the reverse object class, the background and other classes are set to 1, while the object class is set toΒ 0.
- When it is a multi-class segmentation problem, however, an alternative is commonly used by reversing the sign of all class-wise activations (the NEG block) before feeding to the softmax-based classifier. This approach allows the CONV_rev layer to be trained using the same class-wise ground-truth label.
Reverse Attention BranchΒ (RAB)
- Instead of directly applying element-wise subtractions to the original prediction by the activations of the reverse branch due to worse performance, the Reverse Attention Branch is proposed to highlight the regions overlooked by the original prediction (including mixed and background areas). The output of reverse attention would generate a class-oriented mask to amplify the reverse activation map.
- As shown in Fig. 3 and Fig. 5, the initial feature map from the input image is fed into the CONV_orgΒ layer.
- Then, the resulting feature mapβs pixel values are flipped by the NEGΒ block.
- Then, the sigmoid function is applied to convert pixel values between [0, 1], before feeding the feature map to the attention map, where an attention mask isΒ applied.
- The aforementioned steps could be summarized into Formula 1, where i, j indicate the pixel location.
- Therefore, the region with small or negative responses will be highlighted by NEG and the sigmoid operations, but the areas of positive activations (or confident scores) will be suppressed in the reverse attention branch.
Combine theΒ result
- Then, the map from the Reverse Attention Branch is element-wise-ly multiplied by the Reverse Branch. The resulting map is subtracted from the original prediction to generate the final prediction.
βοΈ Training
This is beyond the scope of this article, so we would just show you the original text from theΒ paper:
βοΈ Performance
Thank you!Β β€οΈ
Citation
[1] Semantic Segmentation with Reverse Attention
Explained: Reverse Attention Network (RAN) in Image Segmentation was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI