Abstract:
The three-dimensional scene understanding of remote sensing images can be decoupled into two pixel-level tasks: semantic segmentation and height estimation. Existing methods typically address these tasks independently, neglecting their inherent correlations, which limits the ability to fully leverage complementary information to enhance multi-task performance. This paper proposes a unified multi-task learning network for joint semantic segmentation and height estimation tasks from monocular remote sensing images. It specifically includes: a shared backbone network responsible for extracting feature information required for multi tasks; a context recombination module that adaptively recombine high-level semantic information based on object-scale patterns to enrich representations; a cross-task gating interaction module that effectively alleviates semantic ambiguities caused by similar spectral features through feature interactions between tasks; and a multi-task decoder used to output the final segmentation map and height estimation. This algorithm achieved superior performance on the ISPRS Vaihingen test set, with an overall accuracy of 91.16% and an average intersection to union ratio of 82.88% for semantic segmentation tasks, and 0.261 relative error and 1.283 root mean square error for height estimation tasks. Significantly improved multitasking performance, keeping the boundaries of ground features more complete and the height regression results smoother.