Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition

Recognizing semantic category of objects and scenes captured using vision-based sensors is a challenging yet essential capability for mobile robots and UAVs to perform high-level tasks such as long-term autonomous navigation. However, extracting discriminative features from multi-modal inputs, such...

Full description

Bibliographic Details
Main Authors:	Mohd Zaki, Hasan Firdaus, Shafait, Faisal, Mian, Ajmal
Format:	Article
Language:	English English
Published:	Elsevier 2017
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://irep.iium.edu.my/61281/ http://irep.iium.edu.my/61281/ http://irep.iium.edu.my/61281/ http://irep.iium.edu.my/61281/1/Learning%20a%20deeply%20supervised%20multi-modal%20RGB-D%20embedding%20for%20semantic%20scene%20and%20object%20category%20recognition.pdf http://irep.iium.edu.my/61281/7/61281-Learning%20a%20deeply%20supervised%20multi-modal-SCOPUS.pdf

id	iium-61281
recordtype	eprints
spelling	iium-612812018-07-10T00:43:23Z http://irep.iium.edu.my/61281/ Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition Mohd Zaki, Hasan Firdaus Shafait, Faisal Mian, Ajmal QA75 Electronic computers. Computer science Recognizing semantic category of objects and scenes captured using vision-based sensors is a challenging yet essential capability for mobile robots and UAVs to perform high-level tasks such as long-term autonomous navigation. However, extracting discriminative features from multi-modal inputs, such as RGB-D images, in a unified manner is non-trivial given the heterogeneous nature of the modalities. We propose a deep network which seeks to construct a joint and shared multi-modal representation through bilinearly combining the convolutional neural network (CNN) streams of the RGB and depth channels. This technique motivates bilateral transfer learning between the modalities by taking the outer product of each feature extractor output. Furthermore, we devise a technique for multi-scale feature abstraction using deeply supervised branches which are connected to all convolutional layers of the multi-stream CNN. We show that end-to-end learning of the network is feasible even with a limited amount of training data and the trained network generalizes across different datasets and applications. Experimental evaluations on benchmark RGB-D object and scene categorization datasets show that the proposed technique consistently outperforms state-of-the-art algorithms. Elsevier 2017-06 Article PeerReviewed application/pdf en http://irep.iium.edu.my/61281/1/Learning%20a%20deeply%20supervised%20multi-modal%20RGB-D%20embedding%20for%20semantic%20scene%20and%20object%20category%20recognition.pdf application/pdf en http://irep.iium.edu.my/61281/7/61281-Learning%20a%20deeply%20supervised%20multi-modal-SCOPUS.pdf Mohd Zaki, Hasan Firdaus and Shafait, Faisal and Mian, Ajmal (2017) Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition. Robotics and Autonomous Systems, 92. pp. 41-52. ISSN 0921-8890 https://www.sciencedirect.com/science/article/pii/S0921889016304225 10.1016/j.robot.2017.02.008
repository_type	Digital Repository
institution_category	Local University
institution	International Islamic University Malaysia
building	IIUM Repository
collection	Online Access
language	English English
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Mohd Zaki, Hasan Firdaus Shafait, Faisal Mian, Ajmal Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition
description	Recognizing semantic category of objects and scenes captured using vision-based sensors is a challenging yet essential capability for mobile robots and UAVs to perform high-level tasks such as long-term autonomous navigation. However, extracting discriminative features from multi-modal inputs, such as RGB-D images, in a unified manner is non-trivial given the heterogeneous nature of the modalities. We propose a deep network which seeks to construct a joint and shared multi-modal representation through bilinearly combining the convolutional neural network (CNN) streams of the RGB and depth channels. This technique motivates bilateral transfer learning between the modalities by taking the outer product of each feature extractor output. Furthermore, we devise a technique for multi-scale feature abstraction using deeply supervised branches which are connected to all convolutional layers of the multi-stream CNN. We show that end-to-end learning of the network is feasible even with a limited amount of training data and the trained network generalizes across different datasets and applications. Experimental evaluations on benchmark RGB-D object and scene categorization datasets show that the proposed technique consistently outperforms state-of-the-art algorithms.
format	Article
author	Mohd Zaki, Hasan Firdaus Shafait, Faisal Mian, Ajmal
author_facet	Mohd Zaki, Hasan Firdaus Shafait, Faisal Mian, Ajmal
author_sort	Mohd Zaki, Hasan Firdaus
title	Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition
title_short	Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition
title_full	Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition
title_fullStr	Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition
title_full_unstemmed	Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition
title_sort	learning a deeply supervised multi-modal rgb-d embedding for semantic scene and object category recognition
publisher	Elsevier
publishDate	2017
url	http://irep.iium.edu.my/61281/ http://irep.iium.edu.my/61281/ http://irep.iium.edu.my/61281/ http://irep.iium.edu.my/61281/1/Learning%20a%20deeply%20supervised%20multi-modal%20RGB-D%20embedding%20for%20semantic%20scene%20and%20object%20category%20recognition.pdf http://irep.iium.edu.my/61281/7/61281-Learning%20a%20deeply%20supervised%20multi-modal-SCOPUS.pdf
first_indexed	2023-09-18T21:26:54Z
last_indexed	2023-09-18T21:26:54Z
_version_	1777412252131917824

Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition

Similar Items