From Monocular to Learned vSLAM

Ussama Maqbool; Hammad Munawar; Abdur Rahman

Authors

Ussama Maqbool
Hammad Munawar National University of Sciences and Technology https://orcid.org/0000-0003-0596-1524
Abdur Rahman

Abstract

Size, Weight and Power (SWaP) constraints in robotics cause vSLAM strategies to prefer using monocular cameras due to their high information-to-weight ratio and miniature size. Conventional monoSLAM methodologies compete with stereo and RGB-D SLAM on the front of localization; however, 3D reconstruction of the environment is limited to sparse point clouds. Performance under pure rotation of camera, inherent scale ambiguity and map initialization are few of the many impediments in realizing monoSLAM which arise due to the fact that depth information of the scene is lost when captured by a monocular camera. This imposes adherence to complex algorithms to ameliorate stated issues. In this regard, deep learning architectures have given a new tool to vSLAM strategies by predicting depth maps on learned monocular cues or even regress the full state of the sensor by learning optical ï¬‚ow. Amalgam of these CNNs with conventional vSLAM strategies has given birth to a new class of vision: Learned vSLAM. Motivated by the success of these intelligent vSLAM architectures and their potential role in realization of truly miniature robots, we provide a comprehensive review of Learned vSLAM strategies with their eminence over conventional monoSLAM and impeding limitation.

References

A. Bachrach, R. He, and N. Roy, â€œAutonomous ï¬‚ight in unknown indoor environments,â€ International Journal of Micro Air Vehicles, vol. 1, no. 4, pp. 217â€“228, 2009.

M. Achtelik, A. Bachrach, R. He, S. Prentice, and N. Roy, â€œStereo vision and laser odometry for autonomous helicopters in gps-denied indoor environments,â€ in Unmanned Systems Technology XI, vol. 7332. International Society for Optics and Photonics, Conference Proceedings, p. 733219.

A. Bachrach, S. Prentice, R. He, and N. Roy, â€œRangeâ€“robust autonomous navigation in gps-denied environments,â€ Journal of Field Robotics, vol. 28, no. 5, pp. 644â€“666, 2011.

F. Abrate, B. Bona, and M. Indri, â€œExperimental ekf-based slam for minirovers with ir sensors only,â€ in EMCR, Conference Proceedings.

M. BlÂ¨osch, S. Weiss, D. Scaramuzza, and R. Siegwart, â€œVision based mav navigation in unknown and unstructured environments,â€ in Robotics and automation (ICRA), 2010 IEEE international conference on. IEEE, Conference Proceedings, pp. 21â€“28.

K. Celik, S.-J. Chung, and A. Somani, â€œMono-vision corner slam for indoor navigation,â€ in Electro/Information Technology, 2008. EIT 2008. IEEE International Conference on. IEEE, Conference Proceedings, pp. 343â€“348.

A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and N. Roy, Visual odometry and mapping for autonomous ï¬‚ight using an RGB-D camera. Springer, 2017, pp. 235â€“252.

F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard, â€œAn evaluationofthergb-dslamsystem,â€inIcra,vol.3,ConferenceProceedings, pp. 1691â€“1696.

J. Engel, J. StÂ¨uckler, and D. Cremers, â€œLarge-scale direct slam with stereo cameras,â€ in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Conference Proceedings, pp. 1935â€“1942.

R. Mur-Artal and J. D. TardÂ´os, â€œOrb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,â€ IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255â€“1262, 2017.

T. Taketomi, H. Uchiyama, and S. Ikeda, â€œVisual slam algorithms: A survey from 2010 to 2016,â€ IPSJ Transactions on Computer Vision and Applications, vol. 9, no. 1, p. 16, 2017.

S. M. Abbas and A. Muhammad, â€œOutdoor rgb-d slam performance in slow mine detection,â€ in ROBOTIK 2012; 7th German Conference on Robotics. VDE, Conference Proceedings, pp. 1â€“6.

K. Tateno, F. Tombari, I. Laina, and N. Navab, â€œCnn-slam: Real-time dense monocular slam with learned depth prediction,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6243â€“ 6252.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, â€œImagenet large scale visual recognition challenge,â€ International journal of computer vision, vol. 115, no. 3, pp. 211â€“252, 2015.

E. Royer, M. Lhuillier, M. Dhome, and J.-M. Lavest, â€œMonocular vision for mobile robot localization and autonomous navigation,â€ International Journal of Computer Vision, vol. 74, no. 3, pp. 237â€“260, 2007.

A. J. Davison, â€œReal-time simultaneous localisation and mapping with a single camera,â€ in Iccv, vol. 3, Conference Proceedings, pp. 1403â€“1410.

G. Klein and D. Murray, â€œParallel tracking and mapping for small ar workspaces,â€ in Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE Computer Society, Conference Proceedings, pp. 1â€“10.

R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, â€œOrb-slam: a versatile and accurate monocular slam system,â€ IEEE transactions on robotics, vol. 31, no. 5, pp. 1147â€“1163, 2015.

R.A.Newcombe, S.J.Lovegrove, and A.J.Davison,â€œDtam:Densetracking and mapping in real-time,â€ in 2011 international conference on computer vision. IEEE, Conference Proceedings, pp. 2320â€“2327.

J. Engel, T. Schops, and D. Cremers, â€œLsd-slam: Large-scale direct monocular slam,â€ in European Conference on Computer Vision. Springer, 2014, pp. 834â€“849.

D. Eigen, C. Puhrsch, and R. Fergus, â€œDepth map prediction from a single image using a multi-scale deep network,â€ in Advances in neural information processing systems, Conference Proceedings, pp. 2366â€“2374.

I. Sutskever, G. E. Hinton, and A. Krizhevsky, â€œImagenet classiï¬cation with deep convolutional neural networks,â€ Advances in neural information processing systems, pp. 1097â€“1105, 2012.

K. Simonyan and A. Zisserman, â€œVery deep convolutional networks for large-scale image recognition,â€ arXiv preprint arXiv:1409.1556, 2014.

B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, â€œDepth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,â€ in Proceedings of the IEEE conference on computer vision and pattern recognition, Conference Proceedings, pp. 1119â€“1127.

F. Liu, C. Shen, and G. Lin, â€œDeep convolutional neural ï¬elds for depth estimation from a single image,â€ in Proceedings of the IEEE conference on computer vision and pattern recognition, Conference Proceedings, pp. 5162â€“5170.

P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, â€œTowards uniï¬ed depth and semantic prediction from a single image,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Conference Proceedings, pp. 2800â€“2809.

I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, â€œDeeper depth prediction with fully convolutional residual networks,â€ in 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp. 239â€“248.

C. Godard, O. Mac Aodha, and G. J. Brostow, â€œUnsupervised monocular depth estimation with left-right consistency,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Conference Proceedings, pp. 270â€“279.

H. Laga, â€œA survey on deep learning architectures for image-based depth reconstruction,â€ arXiv preprint arXiv:1906.06113, 2019.

M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison, â€œCodeslamâ€”learning a compact, optimisable representation for dense visual slam,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Conference Proceedings, pp. 2560â€“2568.

J. Tang, J. Folkesson, and P. Jensfelt, â€œSparse2dense: From direct sparse odometry to dense 3-d reconstruction,â€ IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 530â€“537, 2019.

A. Valada, N. Radwan, and W. Burgard, â€œDeep auxiliary learning for visual localization and odometry,â€ in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 6939â€“6946.

S. Wang, R. Clark, H. Wen, and N. Trigoni, â€œDeepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks,â€ in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 2043â€“2050.

A. Kendall and R. Cipolla, â€œGeometric loss functions for camera pose regression with deep learning,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Conference Proceedings, pp. 5974â€“5983.

P. Agrawal, J. Carreira, and J. Malik, â€œLearning to see by moving,â€ in Proceedings of the IEEE International Conference on Computer Vision, Conference Proceedings, pp. 37â€“45.

S. Y. Loo, A. J. Amiri, S. Mashohor, S. H. Tang, and H. Zhang, â€œCnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction,â€ arXiv preprint arXiv:1810.01011, 2018.

Y. Li, C. Xie, H. Lu, X. Chen, J. Xiao, and H. Zhang, â€œScale-aware monocular slam based on convolutional neural network,â€ 08 2018.

S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, â€œSfm-net: Learning of structure and motion from video,â€ arXiv preprint arXiv:1704.07804, 2017.

B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, â€œDemon: Depth and motion network for learning monocular stereo,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Conference Proceedings, pp. 5038â€“5047.

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, â€œUnsupervised learning of depth and ego-motion from video,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Conference Proceedings, pp. 1851â€“1858.

F. T. K. Tateno and N. Nawab, â€œReal-time and scalable incremental segmentation on dense slam,â€ 2015.

A. Handa, T. Whelan, J. McDonald, and A. J. Davison, â€œA benchmark for rgb-d visual odometry, 3d reconstruction and slam,â€ in 2014 IEEE international conference on Robotics and automation (ICRA). IEEE, 2014, pp. 1524â€“1531.

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, â€œA benchmark for the evaluation of rgb-d slam systems,â€ in 2012 IEEE/RSJ

International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 573â€“580.

S. R. T. R. M. E. R. B. U. F. S. R. M. Cordts, M. Omran and B. Schiele, â€œThe cityscapes dataset for semantic urban scene understanding,â€ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213â€“3223.

P. L. A. Geiger and R. Urtasun, â€œAre we ready for autonomous driving? the kitti vision benchmark suite,â€ in Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354â€“3361.

C. Forster, M. Pizzoli, and D. Scaramuzza, â€œSvo: Fast semi-direct monocular visual odometry,â€ in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 15â€“22.

R. Wang, M. Schworer, and D. Cremers, â€œStereo dso: Large-scale direct sparse visual odometry with stereo cameras,â€ in Proceedings of the IEEE International Conference on Computer Vision, Conference Proceedings, pp. 3903â€“3911.

R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, â€œOrb-slam: a versatile and accurate monocular slam system,â€ IEEE transactions on robotics, vol. 31, no. 5, pp. 1147â€“1163, 2015.

H. Zhou, B. Ummenhofer, and T. Brox, â€œDeeptam: Deep tracking and mapping,â€ in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 822â€“838.

S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, â€œSemantic scene completion from a single depth image,â€ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1746â€“1754.

J. Xiao, A. Owens, and A. Torralba, â€œSun3d: A database of big spaces reconstructed using sfm and object labels,â€ in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1625â€“1632.

H. Luo, Y. Gao, Y. Wu, C. Liao, X. Yang, and K.-T. Cheng, â€œReal-time dense monocular slam with online adapted depth prediction network,â€ IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 470â€“483, 2018.

R. Garg, V. K. BG, G. Carneiro, and I. Reid, â€œUnsupervised CNN for single view depth estimation: Geometry to the rescue,â€ in European Conference on Computer Vision. Springer, 2016, pp. 740â€“756.

J. Long, E. Shelhamer, and T. Darrell, â€œFully convolutional networks for semantic segmentation,â€ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431â€“3440.

H. Alismail, B. Browning, and M. B. Dias, â€œEvaluating pose estimation methods for stereo visual odometry on robots,â€ 2010.

F. Yu, V. Koltun, and T. Funkhouser, â€œDilated residual networks,â€ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 472â€“480.

J. Engel, V. Koltun, and D. Cremers, â€œDirect sparse odometry,â€ IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611â€“625, 2017.

S. Song, S. P. Lichtenberg, and J. Xiao, â€œSun rgb-d: A rgb-d scene understanding benchmark suite,â€ in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567â€“576.

S. B. Knorr and D. Kurz, â€œLeveraging the userâ€™s face for absolute scale estimation in handheld monocular slam,â€ in 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, Conference Proceedings, pp. 11â€“17.

J. Mustaniemi, J. Kannala, S. Sarkka, J. Matas, and J. Heikkila, â€œInertial based scale estimation for structure from motion on mobile devices,â€ in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Conference Proceedings, pp. 4394â€“4401.

R. Li, S. Wang, and D. Gu, â€œOngoing Evolution of Visual SLAM from Geometry to Deep Learning: Challenges and Opportunities,â€ Cognitive Computation, vol. 10, no. 6, pp. 875â€“889, 2018.

From Monocular to Learned vSLAM

Authors

Abstract

References

Published

Issue

Section

Developed By

Information