TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

Abstract

We tackle the problem of localizing the traffic surveillance cameras in cooperative perception. To overcome the lack of large-scale real-world intersection datasets, we introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. Moreover, we introduce a novel neural network, TrafficLoc, localizing traffic cameras within a 3D reference map. TrafficLoc employs a coarse-to-fine matching pipeline. For image-point cloud feature fusion, we propose a novel Geometry-guided Attention Loss to address cross-modal viewpoint inconsistencies. During coarse matching, we propose an Inter-Intra Contrastive Learning to achieve precise alignment while preserving distinctiveness among local intra-features within image patch-point group pairs. Besides, we introduce Dense Training Alignment with a soft-argmax operator to consider additional features when regressing the final position. Extensive experiments show that our TrafficLoc improves the localization accuracy over the state-of-the-art Image-to-point cloud registration methods by a large margin (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating strong localization ability across both in-vehicle and traffic cameras.

Method

Taking a pair of 3D point cloud and a 2D image as input, TrafficLoc first performs feature extraction to obtain features in point group level and image patch level. The Geometry-guided Feature Fusion (GFF) module strengthens the feature and then match them based on similarity rule. Fine features are extracted based on the coarse matching results and fine matching is performed between the point group center and the extracted image window with a soft-argmax operation. The final generated 2D-3D correspondences are utilized to optimize the camera pose with RANSAC+EPnP algorithm.

Figure 1: Pipeline of our proposed TrafficLoc for relocalization

Our Geometry-guided Feature Fusion (GFF) module first uses multiple layers of self and cross-attention module to enhance the feature across different modalities (left). The Geometry-guided Attention Loss (GAL) is applied to the cross-attention map of the last fusion layer based on camera projection geometry (right).

Figure 2: Pipeline of Geometry-guided Feature Fusion (GFF) module

Carla Intersection Dataset

We set up a new simulated intersection dataset, Carla Intersection, to study the traffic camera localization problem in varying environment. Carla Intersection dataset comprises 75 intersections across 8 worlds within the Carla simulation environment, encompassing urban and rural landscapes. We use on-board LiDAR sensor to capture point cloud scans, which are then accumulated and downsampled to get the 3D point cloud of the intersection. For each intersection, we captured 768 training images and 288 testing images with known 6-DoF pose at a resolution of 1920x1080 pixel and a horizontal field of view (FOV) of 90◦. In consideration of real-world traffic surveillance camera installations, our image collection spans heights from 6 to 8 meters, with camera pitch angles from 15 to 30 degrees. This setup reflects typical positioning to capture optimal traffic views under varied monitoring conditions. More details of our dataset are in the paper.

Point cloud and example images from Town01 Int1

Point cloud and example images from Town02 Int7

Point cloud and example images from Town03 Int4

Point cloud and example images from Town04 Int5

Point cloud and example images from Town05 Int7

Point cloud and example images from Town06 Int7

Point cloud and example images from Town07 Int2

Point cloud and example images from Town10 Int1

Qualitative Localization Results

Qualitative localization results on the proposed Carla Intersection dataset and KITTI Odometry dataset. The point cloud is projected into a 2D view and shown above the image, with point colors indicating distance. The proposed TrafficLoc achieves better performance, with more correct (green) and fewer incorrect (red) point-to-pixel pairs. The first column presents the input point cloud and input image.

Figure 3: Qualitative results of our TrafficLoc and other baseline methods. (a), (b) and (c) show results on the Carla Intersection Dataset. (d) shows results on the KITTI Odometry dataset.

Ablation Visualization

We visualize the cross-attention map between two modalities. With the use of Geometry-guided Attention Loss (GAL), the P2I attention map of point group P_3D tends to concentrate more on the image region where the point group is projected, while the I2P attention map for patch I_2D assigns greater weights to the area traversed by the camera ray of this patch. Both observations highlight the geometry-awareness of our proposed GAL.

Figure 4: Visualization result of P2I and I2P attention map when using Geometry-guided Attention Loss (GAL) or not. Red color indicates high attention value and blue means low value.

BibTeX

@misc{xia2024trafficloclocalizingtrafficsurveillance,
        title={TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes}, 
        author={Yan Xia and Yunxiang Lu and Rui Song and Oussema Dhaouadi and João F. Henriques and Daniel Cremers},
        year={2024},
        eprint={2412.10308},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.10308}, 
  }