EdgeL3: Compressing L3-Net for Mote-Scale Urban Noise Monitoring

被引：15

作者：

Kumari, Sangeeta ^{[1
]}

Roy, Dhrubojyoti ^{[1
]}

Cartwright, Mark ^{[2
,3
]}

Bello, Juan Pablo ^{[2
,3
]}

Arora, Anish ^{[1
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] NYU, Mus & Audio Res Lab MARL, New York, NY 10003 USA

[3] NYU, Ctr Urban Sci & Progress CUSP, New York, NY 10003 USA

来源：

2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2019年

关键词：

CONVOLUTIONAL NEURAL-NETWORKS;

D O I：

10.1109/IPDPSW.2019.00145

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Urban noise sensing in deeply embedded devices at the edge of the Internet of Things (IoT) is challenging not only because of the lack of sufficiently labeled training data but also because device resources are quite limited. Look, Listen, and Learn (L-3), a recently proposed state-of-the-art transfer learning technique, mitigates the first challenge by training self-supervised deep audio embeddings through binary Audio-Visual Correspondence (AVC), and the resulting embeddings can be used to train a variety of downstream audio classification tasks. However, with close to 4.7 million parameters, the multi-layer L-3-Net CNN is still prohibitively expensive to be run on small edge devices, such as "motes" that use a single microcontroller and limited memory to achieve long-lived self-powered operation. In this paper, we comprehensively explore the feasibility of compressing the L-3-Net for mote-scale inference. We use pruning, ablation, and knowledge distillation techniques to show that the originally proposed L-3-Net architecture is substantially overparameterized, not only for AVC but for the target task of sound classification as evaluated on two popular downstream datasets. Our findings demonstrate the value of fine-tuning and knowledge distillation in regaining the performance lost through aggressive compression strategies. Finally, we present EdgeL(3), the first L-3-Net reference model compressed by 1-2 orders of magnitude for real-time urban noise monitoring on resource-constrained edge devices, that can fit in just 0.4 MB of memory through half-precision floating point representation.

引用

页码：877 / 884

页数：8

共 49 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

[Anonymous], P 3 INT C LEARNING R

[3]

[Anonymous], 2017, CORR

[4]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[5]

[Anonymous], 1992, NIPS

[6]

[Anonymous], 2014, CORR

[7]

[Anonymous], 2017, ARXIV170404861

[8]

[Anonymous], 2015, ICML

[9]

[Anonymous], 2015, NIPS

[10]

[Anonymous], 2016, INTERSPEECH, DOI DOI 10.21437/Interspeech.2016-1446

← 1 2 3 4 5 →