 This paper proposes a novel approach to improve the performance of semantic segmentation algorithms used in remote sensing applications. Specifically, it combines the advantages of Swin Transformers and convolutional neural networks, CNNs. Firstly, Swin Transformers are used to extract more detailed information from the image, while CNNs are employed to capture the spatial context. Secondly, a cross-scale multi-level fusion module is introduced to integrate the outputs of both models. Finally, an up-sampling module is designed to further enhance the resolution of the segmented image. Experimental results show that this method outperforms existing methods on the cityscape's dataset. This article was authored by Jing Chen, Min Xia, DHail1, and others.