Spatiotemporal fusion-based multimodal road feature extraction for 3D visual perception
-
Abstract
Three-dimensional visual perception, a core technology of intelligent driving systems, constructs geometrically and semantically enriched vectorized scene representations by fusing multimodal sensor data (including LiDAR point clouds, camera images, and radar signals). This paper proposes a spatiotemporal fusion-based multimodal road feature parsing framework that innovatively combines transformer architectures with bird's-eye view (BEV) representation learning to design a road feature extraction system. The proposed system employs a multi-scale feature pyramid for heterogeneous sensor data extraction and utilizes attention mechanisms for multi - perspective feature alignment and transformation into BEV space. Furthermore, a spatiotemporal fusion methodology is introduced to enable adaptive integration of multi-frame observational data, thus improving accuracy and recall rate. This system can be widely applied to offline automated annotation systems for automatically generating training ground truth for vehicle online perception models. Experimental results demonstrate that the framework achieves superior precision and recall rates in lane marking and road boundary detection on our proprietary autonomous driving dataset compared to conventional approaches.
-
-