DM-VLP-Grasp: Diffusion Model-Based Grasp Planning with Visual-Language Pretraining for Unknown Object Manipulation
Abstract
This paper proposes an unknown object grasping algorithm (DM-VLP-Grasp) based on diffusion model and visual language pre-training, aiming to improve the grasping performance of robots in complex environments. By improving the visual language pre-training model, the image and text information are integrated to accurately extract the object grasping features; the diffusion model is used to generate a reliable grasping strategy, and efficient grasping is achieved through iterative optimization. On a self-built dataset containing 8000 samples, the results show that the grasping success rate of DM-VLPGrasp reaches 93.6%, and the single strategy generation time is 0.78 seconds, showing high stability and computational efficiency. The grasping stability is measured by the root mean square value (RMS) of the object shaking amplitude and the grasping force fluctuation range, both of which show excellent performance. The experimental results verify the effectiveness and innovation of the algorithm in the unknown object grasping task, and provide a new solution for robot automated grasping.
Full Text:
PDFReferences
Wang, S., Zhou, Z., & Kan, Z. (2022). When transformer meets robotic grasping: Exploits context for efficient grasp detection. IEEE robotics and automation letters, 7(3), 8170-8177.
Liu, Q. C., Zhang, X. Y., Fan, R., Liu, W. M., & Xue, J. F. (2024). A Method for Industrial Robots to Grasp and Detect Instrument Parts under 3D Visual Guidance. Journal of Computers, 35(1), 167-175.
Huang, B., Han, S. D., Yu, J., & Boularias, A. (2021). Visual foresight trees for object retrieval from clutter with nonprehensile rearrangement. IEEE Robotics and Automation Letters, 7(1), 231-238.
Knights, E., Mansfield, C., Tonin, D., Saada, J., Smith, F. W., & Rossit, S. (2021). Hand-selective visual regions represent how to grasp 3D tools: Brain decoding during real actions. Journal of Neuroscience, 41(24), 5263-5273.
Wandelt, S. K., Kellis, S., Bjånes, D. A., Pejsa, K., Lee, B., Liu, C., & Andersen, R. A. (2022). Decoding grasp and speech signals from the cortical grasp circuit in a tetraplegic human. Neuron, 110(11), 1777-1787.
Chen, Y., Wu, Y., Zhang, Z., Miao, Z., Zhong, H., Zhang, H., & Wang, Y. (2022). Image-based visual servoing of unmanned aerial manipulators for tracking and grasping a moving target. IEEE Transactions on Industrial Informatics, 19(8), 8889-8899.
Zhang, S., Chen, Y., Zhang, L., Gao, X., & Chen, X. (2022). Study on robot grasping system of SSVEP-BCI based on augmented reality stimulus. Tsinghua Science and Technology, 28(2), 322-329.
Harrak, M. H., Heurley, L. P., Morgado, N., Mennella, R., & Dru, V. (2022). The visual size of graspable objects is needed to potentiate grasping behaviors even with verbal stimuli. Psychological Research, 86(7), 2067-2082.
Song, K., Wang, J., Bao, Y., Huang, L., & Yan, Y. (2022). A novel visible-depth-thermal image dataset of salient object detection for robotic visual perception. IEEE/ASME Transactions on Mechatronics, 28(3), 1558-1569.
Gong, Z., Qiu, C., Tao, B., Bai, H., Yin, Z., & Ding, H. (2021). Tracking and grasping of a moving target based on an accelerated geometric particle filter on a colored image. Science China Technological Sciences, 64(4), 755-766.
Xu, R., Chu, F. J., & Vela, P. A. (2022). GKNet: Grasp keypoint network for candidate detection. The International Journal of Robotics Research, 41(4), 361-389.
De Farias, C., Marturi, N., Stolkin, R., & Bekiroglu, Y. (2021). Simultaneous tactile exploration and grasp refinement for unknown objects. IEEE Robotics and Automation Letters, 6(2), 3349-3356.
Marwan, Q. M., Chua, S. C., & Kwek, L. C. (2021). Comprehensive review on the reaching and grasping of objects in robotics. Robotica, 39(10), 1849-1882.
Scheikl, P. M., Tagliabue, E., Gyenes, B., Wagner, M., Dall'Alba, D., Fiorini, P., & Mathis-Ullrich, F. (2022). Sim-to-real transfer for visual reinforcement learning of deformable object manipulation for robot-assisted surgery. IEEE Robotics and Automation Letters, 8(2), 560-567.
Jiang, J., Cao, G., Butterworth, A., Do, T. T., & Luo, S. (2022). Where shall I touch? Vision-guided tactile poking for transparent object grasping. IEEE/ASME Transactions on Mechatronics, 28(1), 233-244.
Cheng, H., Wang, Y., & Meng, M. Q. H. (2022). A vision-based robot grasping system. IEEE Sensors Journal, 22(10), 9610-9620.
Hassanin, M., Khan, S., & Tahtali, M. (2021). Visual affordance and function understanding: A survey. ACM Computing Surveys (CSUR), 54(3), 1-35.
Ze, Y., Hansen, N., Chen, Y., Jain, M., & Wang, X. (2023). Visual reinforcement learning with self-supervised 3d representations. IEEE Robotics and Automation Letters, 8(5), 2890-2897.
Orban, G. A., Sepe, A., & Bonini, L. (2021). Parietal maps of visual signals for bodily action planning. Brain Structure and Function, 226(9), 2967-2988.
Song, Y., Wen, J., Liu, D., & Yu, C. (2022). Deep robotic grasping prediction with hierarchical RGB-D fusion. International Journal of Control, Automation and Systems, 20(1), 243-254.
Rolls, E. T., Deco, G., Huang, C. C., & Feng, J. (2023). Multiple cortical visual streams in humans. Cerebral Cortex, 33(7), 3319-3349.
Costanzo, M., De Maria, G., Lettera, G., & Natale, C. (2021). Can robots refill a supermarket shelf?Motion planning and grasp control. IEEE Robotics & Automation Magazine, 28(2), 61-73.
DOI: https://doi.org/10.31449/inf.v49i29.9000

This work is licensed under a Creative Commons Attribution 3.0 License.