CVPR 作為計算機視覺三大頂級會議之一，一直以來都備受關註。被 CVPR 收錄的論文更是代表了計算機視覺領域的最新發展方向和水平。今年，CVPR 2019 將於美國洛杉磯舉辦，上個月接收結果公佈後，又引起了 CV 屆的一個小高潮，一時間涌現出眾多 CVPR 論文的解讀文章。
根據 CVPR 官網論文串列統計的資料，本年度共有 1300 篇論文被接收，而這個資料在過去 3 年分別為 643 篇（2016）、783 篇（2017）、979 篇（2018）。這從一個方面也說明瞭計算機視覺這個領域的方興未艾，計算機視覺作為機器認知世界的基礎，也作為最主要的人工智慧技術之一，正在受到越來越多的關註。
全球的學者近期都沉浸在 CVPR 2019 的海量論文中，希望能第一時間接觸到最前沿的研究成果。但在這篇文章里，我們先把 CVPR 2019 的論文放下，一同迴首下 CVPR 2018 的論文情況。
根據谷歌學術上的資料，我們統計出了 CVPR 2018 收錄的 979 篇論文中被取用量最多的前五名，希望能從取用量這個資料，瞭解到這些論文中，有哪些最為全球的學者們所關註。
根據 CVPR 2018 的論文串列（http://openaccess.thecvf.com/CVPR2018.py）在谷歌學術進行搜索，得到資料如下（以 2019 年 3 月 19 日檢索到的資料為準，因第 2 名及第 3 名資料十分接近，不做明確排名） ：
CVPR 2018 的高被引數論文都是獲得學術界較大關註和推崇的論文，這主要在於他們的開創性。例如，排名第一的 Squeeze-and-Excitation Networks（簡稱 SE-Net）構造就非常簡單，很容易被部署，不需要引入新的函式或者層，並且在模型和計算複雜度上具有良好的特性。
借助 SE-Net，論文作者在 ImageNet 資料集上將 Top-5 error 降低到 2.251%（此前的最佳成績為 2.991%），獲得了 ImageNet 2017 競賽圖像分類的冠軍。在過去一年裡，SE-Net 不僅作為業界性能極強的深度學習網絡單元被廣泛使用，也為其他學者的研究提供了參考。
此外，還有 Google Brain 帶來的 Learning Transferable Architectures for Scalable Image Recognition，提出了用一個神經網絡來學習另一個神經網絡的結構，也為許多學者所關註。
以下是 5 篇文章的摘要，以供讀者們回顧：
Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, several recent approaches have shown the benefit of enhancing spatial encoding.
In this work, we focus on the channel relationship and propose a novel architectural unit, which we term the “Squeeze- and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets.
Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at a minimal additional computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a ∼25% relative improvement over the winning entry of 2016. Code and models are available at https: //github.com/hujie-frank/SENet.
We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ∼13× actual speedup over AlexNet while maintaining comparable accuracy.
Developing neural network image classification models often requires significant architecture engineering. In this paper, we study a method to learn the model architectures directly on the dataset of interest. As this approach is expensive when the dataset is large, we propose to search for an architectural building block on a small dataset and then transfer the block to a larger dataset.
The key contribution of this work is the design of a new search space (which we call the “NASNet search space”) which enables transferability. In our experiments, we search for the best convolutional layer (or “cell”) on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters to design a convolutional architecture, which we name a “NASNet architecture”.
We also introduce a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models. On CIFAR-10 itself, a NASNet found by our method achieves 2.4% error rate, which is state-of-the-art. Although the cell is not searched for directly on ImageNet, a NASNet constructed from the best cell achieves, among the published works, state-of-the-art accuracy of 82.7% top-1 and 96.2% top-5 on ImageNet. Our model is 1.2% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS – a reduction of 28% in computational demand from the previous state-of-the-art model.
When evaluated at different levels of computational cost, accuracies of NASNets exceed those of the state-of-the-art human-designed models. For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms. Finally, the image features learned from image classification are generically useful and can be transferred to other computer vision problems. On the task of object detection, the learned features by NASNet used with the Faster-RCNN framework surpass state-of-the-art by 4.0% achieving 43.1% mAP on the COCO dataset.
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design.
Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on ImageNet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as actual latency, and the number of parameters.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered.
Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.