GF-UNet: A Cropland Extraction Method Based on Attention Gates and Adaptive Feature Fusion

,


Introduction
Cropland plays a crucial role in modern agricultural development and is vital to the survival of human society [1].Timely and accurate access to agricultural information is of great importance for ensuring national food security and promoting sustainable development of the national economy [2].Remote sensing technology, with its wide coverage and timely imaging capabilities, enables rapid updates of agricultural information [3,4].
As the spatial resolution of remote sensing imagery continues to improve, ground objects can be represented with greater accuracy.In particular, very high resolution (VHR) remote sensing imagery, with a resolution of less than 5.0 meters [5], can effectively capture the shape and type of land objects, providing accurate data for precise crop monitoring [2,6].
Traditional cropland extraction methods, including K-Means [7], Support Vector Machine (SVM) [8], Decision Tree (DT) [1], and Random Forest (RF) [9], mainly rely on the intrinsic characteristics of the image spectrum, texture, and geometric information to derive cropland information [10].However, the results of these extraction methods are susceptible to the pepper and salt phenomenon, resulting in reduced accuracy [11].In addition, these techniques underutilize high-level image features such as image morphology and context information, resulting in cropland extraction results that may not meet practical requirements [1,10].
Convolutional Neural Networks (CNNs) have emerged as a highly effective deep learning architecture for semantic segmentation tasks, including the extraction of ground object information from remote sensing images [12].Many researchers have used CNNs to extract various features such as buildings, roads, and cropland due to their ability to independently learn abstract features and capture contextual associations in images without relying on hand-crafted features [13,14,15].For example, Liu et al. [16] used U-Net to identify cropland, effectively mitigating the salt and pepper noise phenomenon associated with traditional methods.Similarly, Du et al. [17] applied DeepLabV3+ to segment irregular small cropland plots.
However, encoder-decoder networks such as U-Net [18], PSPNet [19], and DeepLab_V3+ [17] have a potential disadvantage in that they may introduce irrelevant information that has been filtered out in deep networks.This problem arises from the optimization of the decoder's upsampling results using low-level features from the encoder via skip connections, which may affect the model's segmentation performance [20].To solve this problem, Oktay et al. [21] proposed the Attention U-Net, which incorporates attention gates (AGs) into the U-Net architecture.The Attention U-Net assigns different weights to the connection features, allowing the suppression of irrelevant areas and the highlighting of significant features that are particularly useful for the specific segmentation task.This method helps to improve segmentation accuracy by focusing on relevant information.However, the Attention U-Net does not fully take into account the local spatial details in the shallow features and the relationship between the overall and local contextual features.It also overlooks the importance of spatial detail and location information [22].
These limitations highlight the need for further improvements in the modeling of spatial detail and the integration of local and global contextual features.By addressing these challenges, it is possible to improve the accuracy and performance of cropland extraction models, enabling more accurate identification and delineation of cropland areas in remotely sensed imagery.
Furthermore, the inclusion of multi-scale features is crucial for improving the accuracy of semantic segmentation.Researchers have explored different approaches to incorporate multi-scale information into the models.
Yang et al. [23] used parallel and cascaded architectures of dilated convolutions to design the DenseASPP module, allowing the model to learn more global features.Liu et al. [24] constructed a new residual ASPP to obtain essential multiscale semantic information while avoiding the problem of gradient disappearance.
However, a common challenge with these methods is the use of all channel information from the input features for the feature scale transformation.While this method enables multi-scale feature fusion, it can increase the computational burden of the model and introduce redundant information.To address this issue, researchers have explored techniques to optimize multiscale feature fusion.These include methods such as channel attention and feature recalibration mechanisms that selectively emphasize relevant information and suppress redundant or less informative features.In this way, models can achieve a more efficient and effective fusion of multi-scale features, leading to improved segmentation accuracy.
Based on the research mentioned above, we have developed a novel deep-learning approach for the precise extraction of cropland from very high-resolution (VHR) images.Our approach employs a CNN model that integrates attention gates and multi-scale feature fusion.The following are the main innovative aspects and contributions of our approach: (1) A proposed method for extracting cultivated land from VHR images involves using the GF-UNet model.The GF-UNet model includes an adaptive feature fusion module (AFFM) and a spatial feature extraction module (SFEM) to improve feature recognition and detail extraction capabilities.
(2) The purpose of this study was to collect and process GF-2 satellite data.Data enhancement techniques were used to expand the number of samples to ensure an adequate dataset for the experiment.
(3) To evaluate the effectiveness of our proposed model, we conducted a comparative analysis with several popular semantic segmentation models.The aim was to quantitatively and qualitatively analyze the experimental results, in order to verify the superiority of our method.
The remainder of this paper is organized as follows: Section I gives the introduction, Section II describes the related work, Section III describes the data processing process and the proposed methods, Section IV analyzes the experimental results, Section V discusses the reasons for the performance differences of the models.Finally, Section VI summarizes the thesis.

Related work
Land cover information plays a crucial role in the advancement of agricultural remote sensing.Many scientists have made significant contributions to the research of land cover information extraction and land cover mapping.
Hong et al. [25] introduced a farmland boundary extraction technique that systematically incorporates several computational and mathematical methods, including the Suzuki85 algorithm, Canny edge detection, and the Hough transform, to extract farmland distribution information in six South Korean regions.This algorithm extracts boundaries with 80.7% accuracy, 79.7% completeness, and 67.0%quality, allowing for the automatic creation of farm maps.
Graesser et al. [26] developed a method for cropland area extraction that combines multispectral picture edge extraction, multi-scale contrast-limited adaptive histogram equalization, and adaptive threshold segmentation.The study focused on extracting farmland distribution information from portions of South America, and the extracted results had an F1 score of 91%.The approach is very useful for extracting cropland distribution data over huge areas.It allows for accurate monitoring of agricultural developments.
Zhang et al. [27] proposed a general method for highresolution cropland mapping using deep convolutional neural networks.Their method utilized the Pyramid Scene Resolution Network (PSPNet), which was slightly modified to combine deep remote features with local shadow features.This combination enabled more detailed predictions and improved accuracy in cropland mapping.The MPSPNet, a modified version of PSPNet, was evaluated using high-resolution satellite imagery in four different research areas in China.The method achieved an overall accuracy of 89.99% in the validation process.
In the study [28], the authors introduced a multi-scale fusion network for cropland extraction that incorporates an attention mechanism.This method utilizes an image gradient attention guide module to improve the accuracy of the extracted cropland information.To ensure comprehensive and complete cropland information extraction, the authors also incorporated a multiscale spatial feature consensus fusion model into the network.The experimental results demonstrate that this method efficiently extracts information on cropland boundaries and enables the extraction of semantic information related to cropland.The attention mechanism and multi-scale fusion contribute to improved accuracy and a more comprehensive understanding of cropland areas.
In the study [29], Xu et al. introduced a multi-task cascade network model called SGENet for the extraction of farmland plot information.This model was designed to automatically learn multi-scale and multi-level features, enabling it to handle complex planting scenarios and different scales of farmland plots.
In the study [30], Huan et al. proposed a multi-attention encoder-decoder network (MAENet) for the segmentation of agricultural scenes.The authors aimed to improve the segmentation performance of the network by incorporating several modules, including the dual-pooling efficient channel attention (DPECA) module, the dual-feature attention (DFA) module, and the global-guidance information upsampling (GIU) module.The authors evaluated the performance of MAENet on three self-generated farmland image datasets representing UAV data.The results showed that MAENet achieved an impressive MIoU of 93.74% and Kappa score of 96.74%, outperforming other existing methods.The research discussed in this section is summarized in Table 1.
The GF_UNet model is a cropland extraction network that improves the performance of the standard U-Net network architecture by incorporating AGs to distinguish between different categories with similar features [31].The GF_UNet model is a cropland extraction network that improves the performance of the standard U-Net network architecture by incorporating AGs to distinguish between different categories with similar features.This study proposes modifications to the Attention U-Net architecture to create the GF_UNet model.The purpose of these modifications is to enhance the accuracy and efficiency of extracting croplands.The GF_UNet model underwent evaluation using a self-generated dataset that included a variety of complex cultivated land scenarios.The results of the evaluation demonstrated the model's strong performance, achieving an accuracy rate of 91.25% and an F1score value of 92.41%.These metrics indicate the model's ability to accurately extract cropland regions from input imagery.

Research works Summary of methods Limitation
Development of a Parcel-Level Land Boundary Extraction Algorithm for Aerial Imagery of Regularly Arranged Agricultural Areas [25] An algorithm for extracting farmland boundaries is presented that uses a combination of computational and mathematical techniques, including the Suzuki85 algorithm, Canny edge detection, and Hough transform.This algorithm may not be suitable for the case where the cropland deviates significantly from the shape rule.

Detection of cropland field parcels from Landsat imagery [26]
A cropland area extraction method that involves the combination of multispectral image edge extraction, multiscale contrastlimited adaptive histogram equalization, and adaptive threshold segmentation.
It is important to note that this method relies on prior knowledge of the scene, which may limit its applicability for extracting cropland information over large areas.A generalized approach based on convolutional neural networks for large area cropland mapping at very high resolution [27] A method for high-resolution cropland mapping using deep convolutional neural networks Applying the method to scenes with complex land cover patterns or a mix of different surface types may pose challenges.

Study of Multiscale Fused Extraction of Cropland Plots in Remote Sensing Images Based on Attention Mechanism [28]
A multiscale fusion cropland extraction network that incorporates an attention mechanism.
This method may result in cropland boundaries with some degree of fuzziness, and there may be instances where parcels are connected.
Extraction of cropland field parcels with high resolution remote sensing using multi-task learning [ 3 Materials and proposed method

Data preprocessing
Experimental data from high-resolution Earth observation system data and application center of Hubei Province (http://datasearch.hbeos.org.cn),including 16 of micro-cloud cover (<5%), the high quality of GF-2 scene satellite images, it covers Xuen County, Hubei Province in central China (29°01'-33°06'N, 108°21'-116°07'E).First, we used ENVI 5.3 to preprocess the collected GF-2 remote sensing data with orthometric correction, atmospheric correction, radiometric correction, and multispectral and panchromatic band fusion [32].The atmospheric correction was performed using the FLAASH atmospheric correction [33].The full-color image was sharpened using a sharpening filter before image fusion.The acquired multispectral and panchromatic images were then fused at a 4:1 ratio, and the fused GF-2 image had a spatial resolution of 1 meter.
Cropland samples of GF-2 images were mapped based on the pre-processed GF-2 images and the actual collected cropland images.The GF-2 images and mapped cropland samples were cropped to 256×256 size and divided into training data, validation data, and test data.The grain within the cropland appears uniform, and the boundary of the cropland is clearly defined.

Type 2
The boundary between large croplands is easily distinguishable, but the features of small croplands vary.
Type 3 The characteristics of cropland closely resemble those of the surrounding background.

Type 4
The cropland has an irregular shape, and its boundaries are interconnected.

Network structure
To accurately extract the cropland distribution information, this paper proposes the GF-UNet network model, and its structure is shown in Figure 4.
GF-UNet adds the AFFM and SFEM based on Attention U-Net, which mainly consists of four parts: the encoder, the decoder, the SFEM, and the AGs.Similar to Attention U-Net, the first three layers of the GF-UNet encoder consist of two convolutional layers (Conv) with a batch normalization layer (BN) and a linear rectification function (ReLU) in series.The fourth layer consists of AFFM.AFFM extracts the global semantic information of the farmland by fusing the multi-scale farmland features and combines the squeeze and excitation (SE) channel attention mechanism [34] to obtain the channel weight distribution values of the fused farmland features, which enhances the ability of the network to recognize the farmland attributes.To effectively extract the spatial detail information of the low-level features and preserve the location information of the spatial details, GF-UNet adds SFEM to the skip connection.Then, the result of SFEM is used as the input of AGs to increase the responsiveness of the network to the cropland features.Finally, the final cropland distribution information is output through the convolution module in the decoder.

Adaptive feature fusion module
In the task of semantic segmentation, it is crucial to integrate multiscale information due to variations in segmented objects.Relying solely on a single scale of features often leads to inadequate extraction outcomes [35].This paper proposes an Adaptive Feature Fusion Module (AFFM), which comprises a parallel multi-branch network consisting of a multi-scale feature fusion module and an attention enhancement module.
AFFM effectively captures both the global contextual features of the field and the primary semantic information of the cropland.It accomplishes this through its multiscale feature fusion module that comprehensively learns field features, along with an attention enhancement module that learns channel weight distribution for cropland features while reducing redundant information during network training.Figure 2 illustrates the structure of AFFM.
AFFM divides the input feature X into four sub-features: X1, X2, X3 and X4, in channel order.These sub-features capture different aspects of the input data.To capture feature information at different scales, X1, X2, and X3 are sequentially pooled.Specifically, X1 is subsampled 8 times, X2 is subsampled 4 times, and X3 is subsampled 2 times.The process of pooling reduces the spatial resolution of the features while preserving their essential information.However, X4 does not undergo any pooling operation.Therefore, X4 retains its original spatial resolution and is not downsampled like X1, X2, and X3.By keeping the spatial details intact in X4, the network can capture fine-grained information and maintain the location information of the features.To capture global contextual information and achieve a broader receptive field, we utilized depth-separable convolution with a 3×3 kernel size to extract four sub-features.After extracting the features, X1, X2, and X3 are upsampled to match the spatial resolution of the input feature X, resulting in four different scales of feature maps: Y1, Y2, Y3, and Y4.To combine information from the four scales, concatenate the feature maps Y1, Y2, Y3, and Y4 sequentially.This creates a new feature map that encapsulates information from different scales.Apply a 1×1 convolution operation to the concatenated feature map to allow for the interaction of channel information across the different scales.To adaptively weigh the spatial information, the paper uses the sigmoid activation function to obtain the spatial attention weight.This weight is then multiplied element-wise with the input feature X, resulting in a spatially adaptive feature map denoted as Ys.The multiplication process ensures that the features deemed important by the attention weight receive more emphasis, while less important features are downweighted. ( Where UP is upsampling, DWConv denotes depth- separable convolution, Pool is pooling operation, concat is stacking according to channel,  denotes Sigmoid ,  denotes pixel multiplication. The channel weights for the spatial adaptive feature map Ys are determined using global average pooling (GAP), a fully connected layer (FC), and a sigmoid activation function.These operations generate the channel weight values for Ys.By multiplying these channel weights with Ys, we can redistribute the weights of Ys and obtain the adaptive feature Y.This adaptive feature captures the refined and tuned contributions of each channel, ultimately enhancing the discrimination of features.

Spatial feature extraction module
The attentional mechanism, inspired by human vision, is effective in focusing on important detailed features during network training, thus improving network performance [36].The CBAM attention mechanism, which includes both channel and spatial attention, enhances the network's learning capability [37].
To obtain spatial attention, CBAM calculates spatial attention weights by performing average pooling and maximum pooling operations on the channel dimensions.These operations extract the maximum and average values within each 154 Informatica 48 (2024) 149-162 C. Li et al. channel at the same spatial location.However, it is important to note that the pooling operation can result in the loss of channel information, which can be detrimental to the effective transfer of information during network training.
Retaining channel information is crucial for preserving the discriminative power of features.Without channel information, the network may struggle to capture the fine-grained details necessary for accurate segmentation.Therefore, it is important to address the potential loss of channel information when using spatial attention mechanisms like CBAM.
To tackle the problem of potential loss of channel information due to pooling operations in the CBAM attention mechanism, SFEM removes the pooling layer of channel dimension in CBAM.Instead, it uses two layers of 7×7 depthseparable convolutions to increase the receptive field and capture more global spatial feature information.The use of depth-separable convolutions reduces computational complexity while maintaining effectiveness in capturing spatial features.
To facilitate the comprehensive understanding of input features, a 1×1 convolution is used to enable the interaction of channel information, performing upscaling and downscaling of feature channels.
SFEM improves the contextualization of spatial features by incorporating these modifications.SFEM contributes to the detailed restoration of plowing results by preserving the overall structure and details of the image.Figure 3 depicts the structure of SFEM, showcasing the arrangement of the components involved in capturing global spatial feature information and promoting the interaction of channel information.

Attention gates
Attention Gates (AGs) identify salient feature regions using the attention coefficient [0,1]

 
and suppress the responses of irrelevant features, maximizing the retention and activation of neurons associated only with salient features [38].The structure of AGs can be seen in Figure 4.
In AGs, relevant feature representations from the input are captured by extracting the input feature l x and the gate signal g x as two types of feature information using 1×1 convolutions l W and g W , respectively.The two types of feature information are then fused together to obtain the feature image t x , which combines the salient information from both l x and g x .To obtain feature information q x , activate the feature of t x using the ReLU function and reduce the feature dimensionality by 1 ×1 convolution  .The feature image t x is passed through the ReLU activation function to enhance its activation.Then, it undergoes dimensionality reduction using a 1×1 convolution  .This step reduces the number of feature channels while preserving important information q x .
Next, the sigmoid activation function is applied to q x to obtain the attention coefficients  .These coefficients are then resampled and multiplied with the input features l x to obtain ' l x .This process enhances the representation of salient features while suppressing irrelevant feature regions.

Loss function
In remote sensing images, the imbalance between cultivated and non-cultivated land is particularly noticeable in mountainous areas.When using binary cross entropy loss (BCE Loss), equal weight is given to each class, which can cause the model to learn in the wrong direction [39].On the other hand, Dice Loss [40] is better at extracting the foreground and is more suitable for unbalanced samples.However, the loss function presents a gradient instability problem, which can result in suboptimal convergence of training results [41].Therefore, the BCE-Dice Loss function, which combines the Dice Loss and BCE Loss, is more suitable for measuring the fitness of predicted and actual values in cultivated land extraction results.The calculation formula is as follows: Where i y represents the true value of the i th pixel, i y  represents the predicted value of the i th pixel, and N is the number of pixels.

Performance assessment
To evaluate the results of cropland extraction, four evaluation metrics based on the confusion matrix were utilized: Precision, Recall, F1-score, and Intersection over Union (IoU) are all metrics used to evaluate the performance of classification models.Precision is the ratio of true positive predictions to all positive predictions, Recall is the ratio of true positive predictions to all true positive values, F1-score is the harmonic mean of Precision and Recall, and IoU is the ratio of the intersection to the union of the predicted and true values.The formulas for calculating these metrics are as follows: Where TP is true positive, TN is true negative, FN is false negative, and FP is false positive.

Result analysis
The experiment is based on the TensorFlow 2.6 deep learning framework and uses Python 3.6 to execute the code.The computer hardware uses an Intel Core i7-11700F CPU, an NVIDIA GeForce RTX 3060 graphics card with 12 GB of video memory, and CUDA 11.2 accelerated computing.The computer's operating system is Windows 10.
The Adam optimizer was selected for training the model.Adam is an abbreviation for Adaptive Moment Estimation and is commonly used in deep learning tasks.It combines the advantages of adaptive learning rate methods and momentumbased methods, which help to alleviate the issues caused by gradient oscillation during training.
To examine the effect of batch size and learning rate on training and validation accuracy, we conducted various experiments with different parameter values.The results were analyzed and presented in Figure 5 and Figure 6, which display the training and validation accuracy curves, respectively.
Figure 5 illustrates the training accuracy curve over the course of training, while Figure 6 shows the validation accuracy curve.By analyzing these curves, one can gain insights into how various batch sizes and learning rates impact the model's performance.Upon analyzing the change curves of training accuracy and validation accuracy, it is evident that an increase in the learning rate leads to gradual improvement in training accuracy.However, the validation accuracy initially increases but then starts to decrease, indicating that a higher learning rate may result in faster convergence during training, leading to improved training accuracy but also potentially causing overfitting and a decrease in validation accuracy.When the learning rate is less than 5e-3, the validation accuracy exceeds the training accuracy.This suggests that the network model is overfitting, which means it is overly optimized for the training data and has difficulty generalizing to unseen data.As for the batch size, it is noted that setting it to 12 maximizes both the training and validation accuracy.It is suggested that a batch size of 12 strikes a balance between computational efficiency and model performance for the given task.Based on these observations, it is determined that the initial learning rate should be set to 5e-3 and the batch size should be 12.These values are chosen to optimize the training process and achieve the best possible accuracy on both the training and validation datasets.The effectiveness of the proposed architecture for remote sensing image segmentation tasks is highlighted by the significant improvement of GF-UNet over classical models.This improvement can be attributed to the unique design choices and architectural modifications made in GF-UNet, which have enhanced its ability to accurately segment remote sensing images.
Additionally, GF-UNet outperforms two recent methods, MAENet and SGENet.This indicates that the proposed architecture outperforms not only traditional models but also more contemporary approaches, showcasing its state-of-the-art performance in remote sensing image segmentation.
The high Precision, F1-score, and IoU values achieved by GF-UNet demonstrate its ability to accurately identify positive samples, achieve a balance between precision and recall, and accurately capture the overlap between predicted and true positive areas.These metrics highlight the robustness and quality of the segmentation results produced by GF-UNet. Figure 7 displays the semantic segmentation outcomes of different methods.The first column presents the cropland image, the second column shows the cropland label and the remaining columns exhibit the segmentation results of various methods.
The performance of different methods is described qualitatively by analyzing the results in Figure 7. GF-UNet's segmentation results show clear edge features, accurate detail features, and the best overall result, making it the best performing network.On the other hand, U-Net, PSPNet, and DeepLab_v3+ tend to miss small area targets, resulting in poor segmentation results.The qualitative analysis of the segmentation results proves the superiority and effectiveness of GF-UNet.

Discussion
The spectral and textural characteristics of cropland vary due to complex cropland types, diverse crop varieties, and different phenological characteristics [42].Additionally, VHR remote sensing commonly exhibits homogeneity and heterogeneity, further complicating the extraction of cropland characteristics [43].This paper validates the proposed model using self-made cropland datasets.To increase the dataset's sample capacity, data preprocessing and data augmentation methods are used.The dataset includes four different scenarios to capture the variability present in cropland imagery.The performance of each method is analyzed by comparing the proposed model with other state-of-the-art (SOTA) models.This analysis compares the proposed model to existing approaches to determine its effectiveness and superiority.
Table 3 provides a comprehensive comparison of the performance metrics of the different models.GF-UNet stands out with a precision of 0.9125, outperforming MAENet and SGENet by 2.5% and 1.11%, respectively, indicating its superior ability to accurately identify positive samples.SGENet achieves the highest recall of 0.9442, slightly ahead of GF-UNet by 0.85%.While SGENet performs slightly better at capturing all true positive samples, the margin is relatively small.The F1 score, which balances precision and recall, highlights the superiority of GF-UNet over Attention U-Net and SGENet, with an F1 score of 0.9241.GF-UNet strikes a better balance between precision and recall, providing a more holistic assessment of its segmentation performance.In addition, GF-UNet achieves the highest Intersection over Union (IoU) value of 0.8456, outperforming MAENet and SGENet by 3.31% and 3.69%, respectively.The IoU metric indicates the accuracy and quality of the segmentation results, with GF-UNet demonstrating superior precision in capturing true positive areas.Both the F1 score and IoU serve as critical evaluation metrics for network models, with GF-UNet emerging as a top performer in both categories.These results underscore its exceptional overall performance and minimal disparity between predicted and actual results.
To investigate the factors that enhance the performance of the proposed model, ablation experiments were conducted using Attention U-Net as the base network to analyze the impact of AFFM and SFEM on model performance.The ablation effect was evaluated using IoU as the index.The results of the ablation experimental evaluation indexes are presented in Table 4, and the ablation experimental results are shown in Figure 8.
Table 4 shows that the addition of the SFEM module to Attention U-Net increases the model's IoU by 1.78%, resulting in a total of 81.24%.Similarly, the addition of the AFFM module to Attention U-Net increases the model's IoU by 3.4%, resulting in a total of 82.86%.Both modules were added separately to Attention U-Net and have been shown to improve the model's performance.In combination with the results of Figure 8, it is evident that the SFEM module enhances the clarity of the cropland edge features extracted by the model, particularly the portion connected to the farmland and buildings.
The SFEM module improves the clarity of cropland edge features extracted by the model, especially in areas where farmland and buildings are connected.It allows for more precise delineation of the boundaries between cropland and other structures, which is particularly noticeable in regions where cropland and buildings are adjacent or overlapping.The SFEM module refines the segmentation results by emphasizing distinctive features associated with cropland edges, resulting in clearer and more accurate boundary delineation.
The AFFM module improves the model's ability to distinguish between farmland and forest land with similar features.It integrates multi-scale features, enlarges the model's receptive field, and extracts richer semantic information.Additionally, the channel attention mechanism SE strengthens the model's learning ability and improves its performance.
Our proposed model utilizes both SFEM and AFFM, resulting in an increased IoU of 84.56%.By combining the advantages of SFEM and AFFM, our model not only enhances its learning ability but also retains more detailed information.Compared to other state-of-the-art models, our model achieves higher segmentation accuracy and more precise edge features.
However, during the experiment, we encountered an issue where our model tended to overlook small, isolated areas of farmland, resulting in segmentation loss.We need to improve the extraction of broken, irregularly shaped farmland with connected edges.Despite this, our model effectively enhances feature discrimination and preserves details.Our model can be applied to semantic segmentation in complex scenarios, which includes but is not limited to the extraction of farmland features.

Conclusion
Due to the distinct spectral and textural features of cropland in high-resolution remote sensing images, the classical semantic segmentation network yielded inaccurate and incomplete results for cropland extraction.To address this issue, the GF-UNet network based on the Attention U-Net architecture is proposed.In GF-UNet, attention gates (AGs) are employed to enhance discriminative ability between partially cropped areas and non-cropland features in complex scenes.In order to improve the network's ability to extract different categories of cropland attributes, we utilize an adaptive feature fusion module (AFFM).Additionally, we introduce a skip connection layer with a spatial feature extraction module (SFEM) to refine detailed features extracted from intermediate layers.Our method is evaluated using GF-2 images captured in Xuan'en County, Hubei Province.The experimental results show that GF-UNet achieves an F1 score of 92.41% and a crossover ratio of 84.56%.Our proposed approach provides more accurate and comprehensive extraction of cropland information compared to SOTA methods.In the future, we will focus on incorporating phenological characteristics specific to different crops to improve categorization accuracy, considering the significant influence of crop types and phenological characteristics on cropland dynamics over time.
Data enhancement operations such as rotation and noise addition were performed on the training and validation data to increase the number of samples, which could avoid overfitting the training model due to insufficient training data during the training process.The dataset for this study consisted of 9863 training sets, 1932 validation sets, and 1860 test sets, with an approximate ratio of 8:1:1.It included four types of croplands, and their characteristics are summarized in Table2.

Figure 1 :
Figure 1: Architecture of GF-UNet FC represents the fully connected layer, GAP represents the global average pooling, 1  represents the ReLU activation function, and 2  represents the Sigmoid activation function.

Figure 5 :Figure 6 :
Figure 5: The influence curve of learning rate on training accuracy and verification accuracy

Figure 7 :
Figure 7: Visualization of cropland extraction results by multiple methods.
C.Li et al.

Table 2 :
Cropland type characteristics in dataset

Table 3 :
Results of evaluation indicators of multiple methods.