Optimization and Validation of Two Machine Learning Algorithms for Accurate Prediction of Irrigated Wheat (Triticum aestivum L.) Yield and Identification of its Influential Factors in Khorasan Razavi Province

Document Type : Research Article

Author

Department of Agrotechnology, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran

Abstract

Introduction
This study undertook a detailed comparison of two supervised machine-learning algorithms—Random Forest (RF) and eXtreme Gradient Boosting (XGBoost)—to predict irrigated wheat (Triticum aestivum L.) yield across 20 counties in Razavi Khorasan Province. Both models were trained on 70 % of the dataset (years 1383–1402) and tested on the remaining 30 %. Hyperparameter tuning was performed via a grid search coupled with five-fold cross-validation.
Materials and Methods
The two algorithms were initially trained and optimized using 70% of the data (approximately 14 years per county) and subsequently tested on the remaining 30% (approximately 6 years). Hyperparameter tuning was performed through a grid search combined with five-fold cross-validation. Hyperparameter Tuning was performed using Gridsearch over key parameters (e.g., number of trees n_estimators, maximum depth max_depth, learning rate for XGBoost) with 5-fold cross-validation. Afterwards, both models were evaluated and validated using RMSE (Root Mean Squared Error), R² (Coefficient of Determination), MAE (Mean Absolute Error), Willmott’s d (Index of Agreement). Two machine learning algorithms, Random Forest and XGBoost, were developed to predict wheat performance at the district level. The performance values were initially predicted numerically using regression and then divided into three distinct classes using statistical percentiles: Low: 0th to 33rd percentile; Medium: 33rd to 66th percentile; High: 66th to 100th percentile. This classification was based on both the actual and predicted values for each district. The results of this classification are presented in the form of a confusion matrix for each model. Using three indices—Aridity Index, Seasonal Intensity of Temperature, and Growing Degree Days—districts in the province were clustered into three climatic zones. Then, wheat performance in these zones was analyzed based on two models. In this study, the machine learning algorithms Random Forest and XGBoost were implemented in Python 3.3 using the Scikit-learn library. DataFrame preparation and the clustering of the province’s counties into climatic zones were carried out using R 4.3.2 and ArcGIS 10.8.2.
Results and Discussion
Alghoritms Performance: On average, RF reduced prediction error by ~19 % (395 vs. 492 kg ha-1 RMSE) and achieved a slightly higher agreement with observed yields (d=0.467 vs. 0.419). Random forest showed Best RF Performance for Khalilabad (RMSE = 141.19 kg ha-1, MAE = 125.68 kg ha-1, d = 0.37) and Bardeskan (RMSE = 187.69 kg ha-1, d = 0.80); Worst RF Performance was obtained for Quchan (RMSE = 667.65 kg ha-1, d = 0.36) and Torbat-e Heydarieh (RMSE = 581.33 kg ha-1, d = 0.47). Best XGBoost Performance was obtained for Torbat-e Jam (RMSE = 269.36 kg ha-1, d = 0.77), Neyshabur (RMSE = 377.91, d = 0.78). Worst XGBoost Performance resulted for Quchan (RMSE = 943.99 kg ha-1, d = 0.25) and Torbat-e Heydarieh (RMSE = 786.20 kg ha-1, d = 0.39). In 6 out of 20 counties (Khaf, Mahvelat, Kalat-e Nader, Kashmar, Chenaran, Fariman) both RF and XGBoost performed nearly identical errors (ΔRMSE < 15 kg ha-1), indicating similar predictive power under those local conditions.
Feature Importance: Daily Minimum Temperature (Tmin): Ranked #1 in RF’s importance list; ranked #3 in XGBoost. Seasonal Tmin (TminGS): Consistently #3 in both models. Other Key Predictors: Precipitation over the growing season (Prec, PGS), Growing Degree Days (GDDGS), and Evapotranspiration (ETGS) all contributed substantially, though with 4th–8th ranks markedly lower in RF than in XGB.
Classifiction of Yield into Three Performance Classes: Using percentile thresholds—Low (0–33rd), Medium (34–66th), High (67–100th)—the models were also evaluated as classifiers. Low-Performing Counties Needing Intervention Seven counties (35 % of the sample)- Quchan, Torbat-e Heydarieh, Sarakhs, Kalat-e Nader, Gonabad, Neyshabur, Taybad- fell in the Low performance class across both models. These areas should be prioritized for targeted agronomic management and resource allocation.
Cluster-Based Insights: Counties were grouped into three agro-climatic clusters: Very Dry & Hot (4 counties): RF outperformed XGB in all (e.g., Bardeskan, Khalilabad, Mahvelat, Sarakhs). Semi-humid Cooler (6 counties): Mixed results-RF won in 4 (Chenaran, Kalat-e Nader, Mashhad, Nishapur); XGB was slightly better in 2 (Fariman, Quchan). Warm Semi-arid (10 counties): RF superior in 7; equivalence in Taybad & Torbat-e Heydarieh; XGB edged ahead only in no counties here.
Conclusion
Overall, the Random Forest model showed better results for predicting wheat yield in Razavi Khorasan province, especially in most counties. Although the XGBoost model has higher potential for modeling complex patterns, Random Forest performed more accurately in conditions of greater data dispersion. In conclusion, the results of this study emphasize that by utilizing climatic and agricultural data, machine learning algorithms can be optimized not only to achieve high accuracy but also to provide a clear interpretation of the contribution of each variable in wheat yield using the SHAP tool.

Keywords


Authors retain the copyright. This is an open access article distributed under Creative Commons Attribution 4.0 International License (CC BY 4.0)

  1. Arun, G., and Ghimire, K. (2019). Estimating post-harvest loss at the farm level to enhance Food Security: A Case of Nepal. International Journal of Agriculture, Environment and Food Sciences, 3(September), 127–136. available from https://doi.org/10.31015/jaefs.2019.3.3
  2. Aslan, M. F., Sabanci, K., & Aslan, B. (2024). Artificial intelligence techniques in crop yield estimation based on Sentinel-2 data: A comprehensive survey. Sustainability, 16(18), 8277. https://doi.org/10.3390/su16188277
  3. Asseng, S., Ewert, F., Martre, P., Rötter, R. P., Lobell, D. B., Cammarano, D., Kimball, B.A., Ottman, M.J., Wall, G.W., White, J.W., Reynolds, M. P., Alderman, P.D., Prasad, P.V.V., Aggrawal, P.K., Anothai, J., Basso, B., Biernath, C., Challinor, A. J., De Sanctis, G., Doltra, J., Fereres, E., Garcia-Vila, M., Gayler, S., Hoogenboom, G, & Zhu, Y. (2015). Rising temperatures reduce global wheat production. Nature Climate Change, 5(2), 143–147. https://doi.org/10.1038/nclimate2470
  4. Dhillon, M. S., Dahms, T., Kuebert-Flock, C., Rummler, T., Arnault, J., Steffan-Dewenter, I., & Ullmann, T. (2023). Integrating Random Forest and crop modeling improves the crop yield prediction of winter wheat and oil seed rape. Frontiers in Remote Sensing, 3, 1010978. https://doi.org/10.3389/frsen.2022.1010978
  5. Everingham, Y., Sexton, J., Skocaj, D., & Inman-Bamber, G. (2016) Accurateprediction of sugarcane yield using a Random Forest algorithm. Agronomy for Sustainable Development, 36(2), 1–9. https://doi.org/10.1007/s13593-016-0364-z
  6. Farhadi, M., Bannayan, M., Fallah, M. H., & Jahan, M. (2024). Identiication of climatic and management factors inluencing wheat’s yield variability using AgMERRA dataset and DSSAT model across a temperate region. Discover Life, 54(8). https://doi.org/10.1007/s11084-024-09651-8
  7. Gheysarbeigi, S., Pir Bavaghar, M., & Valipour, A. (2024). Forest aboveground biomass estimation using satellite imagery and Random Forest regression model. Geography and Environmental Sustainability, 14(1), 85-100. https://doi.org/10.22126/GES.2024.9971.2715
  8. Hatfield, J. L., & Prueger, J. H. (2015). Temperature extremes: Effect on plant growth and development. Weather and Climate Extremes, 10, 4–10. https://doi.org/10.1016/j.wace.2015.08.001
  9. (2022). Sixth Assessment Report. Available Online at: https://www.ipcc.ch/site/assets/uploads/2022/04/AR6_Factsheet_April_2022.pdf
  10. Javadi, A., Ghahremanzadeh, M., Sassi, M. Javanbakht, O., & Hayati, B. (2024). Impact of climate variables change on the yield of wheat and rice crops in Iran (Application of Stochastic Model based on Monte Carlo Simulation). (2024). Computational Economics, 63, 983–1000. https://doi.org/10.1007/s10614-023-10389-0
  11. Jhajharia, K., Mathur, P., Jain, S., & Nijhawan, S. (2023). Crop yield prediction using machine learning and deep learning techniques. Procedia Computer Science, 218, 406-417. https://doi.org/10.1016/j.procs.2023.01.023
  12. Khodabandehloo, E., Azadbakht, M., Radiom, S., Ashourloo, D., & Alimohammadi, A. (2021). Prediction of wheat fusarium head blight severity by using Random Forest. Iranian Remote Sensing & GIS, 13(4), 1-44. (in Persian with English abstract). https://doi.org/10.52547/gisj.13.4.1
  13. Khodjaev, S., Bobojonov, I., Kuhn, L., & Glauben, T. (2025). Optimizing machine learning models for wheat yield estimation using a comprehensive UAV dataset. Modeling Earth Systems and Environment, 11, 15. https://doi.org/10.1007/s40808-024-02188-9
  14. Kim, Y., & Kim, Y. (2022). Explainable heat-related mortality with random forest and SHapley Additive exPlanations (SHAP) models. Sustainable Cities and Society, 79, 103677. https://doi.org/10.1016/j.scs.2022.103677
  15. Koocheki, A., Nassiri-Mahallati, M., Kamali, Gh., Shahandeh, Gh. (2006). Potential Impacts of Climate Change on Agroclimatic Indicators in Iran, Arid Land Research and Management 20(3):245-259. https://doi.org/10.1080/15324980600705768
  16. Krishnadoss, N., & Ramasamy, L. K. (2024). Crop yield prediction with environmental and chemical variables using optimized ensemble predictive model in machine learning. Environmental Research Communication, 6(10), 101001. https://doi.org/10.1088/2515-7620/ad7e81
  17. Monavar Sabegh, S., Zare Haghi, D., Samadianfard, S., Neishabouri, M. R., & Mikaeili, F. (2023). Estimation of daily reference evapotranspiration using Random Forest optimized by genetic algorithm. Water and Soil Science, 33(4), 33-53. https://doi.org/10.22034/ws.2021.48756.2449
  18. Moreno Sánchez, J. C., Acosta Mesa, H. G., Trueba Espinosa, A., Ruiz Castilla, S., & García Lamont, F. (2025). Improving wheat yield prediction through variable selection using Support Vector Regression, Random Forest, and Extreme Gradient Boosting. Smart Agricultural Technology, 10, 100791. https://doi.org/10.1016/j.atech.2025.100791
  19. Nayak, H. S., Silva, J. V., Parihar, C. M., Krupnik, T. J., Sena, D. R., Kakraliya, S. K., Jat, H. S., Sidhu, H. S., Sharma, P.C., Jat, M.L., & Sapkota, T.B. (2022). Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India. Field Crops. Research., 2022, 287, 108640. https://doi.org/10.1016/j.fcr.2022.108640
  20. Oghnoum, M., Feghhi, J., Makhdoum, M., Moghaddamnia, A., & Etemad, V. (2019). Land capability evaluation of afforestation using Random Forest algorithm (Kan Watershed, Tehran). Journal of Forest Research and Development, 5(3), 387-403.
  21. Pang, A., Chang, M. W., & Chen, Y. (2022). Evaluation of Random Forest for regional and local-scale wheat yield prediction in Southeast Australia. Sensors, 22, 717. https://doi.org/10.3390/s22030717
  22. Peel, M. C., Finlayson, B. L., & McMahon, T. A. (2007). Updated world map of the Köppen–Geiger climate classification. Hydrology and Earth System Sciences, 11(5), 1633–1644. https://doi.org/10.5194/hess-11-1633-2007
  23. Ray, D. K., Gerber, J. S., Macdonald, G. K., & West, P. C. (2015). Climate variation explains a third of global crop yield variability. Nature Communications, 6, 1–9. https://doi.org/10.1038/ncomms6989
  24. Raymundo, R., Asseng, S., Robertson, R., Petsakos, A., Hoogenboom, G., Quiroz, R., Hareau, G., & Wolf, J., (2018). Climate change impact on global potato production. European Journal of Agronomy, 100, 87-98. https://doi.org/10.1016/j.eja.2017.11.008
  25. Remman, S. B., Lekkas, A. M. (2021). Robotic lever manipulation using hindsight experience replay and shapley additive explanations. European Control Conference (ECC). Cornell University, USA. https://doi.org/23919/ecc54610.2021.9654850
  26. Roell, Y. E., Beucher, A., Møller, P. G., Greve, M. B., & Greve, M. H. (2020). Comparing a Random Forest based prediction of winter wheat yield to historical yield potential. Agronomy, 10(3), 3. https://doi.org/10.3390/agronomy10030395
  27. Sadeghi, M., &Ahmadi Nadoushan, M. (2021). Modeling soil nitrogen using Remote Sensing, Regression and Random Forest models. Journal of Water and Soil Resources Conservation (WSRCJ), 11(2), 97-111.
  28. Shen, Y., Mercatoris, B., Liu, Q., Yao, H., Li, Z., Chen, Z., & Wang, W. (2024). Use Self-Training Random Forest for Predicting Winter Wheat Yield. Remote Sensing16(24), 4723. https://doi.org/10.3390/rs16244723
  29. Si, Z., Qin, A., Liang, Y., Duan, A., & Gao, Y. (2023). A review on regulation of irrigation management on wheat physiology, grain yield, and quality. Plants, 12(4), 692. https://doi.org/10.3390/plants12040692
  30. Slafer, G. A., Savin, R., Sadras, V. O., & Calderini, D. F. (2023). Wheat yield improvement: Physiological and agronomic basis. Field Crops Research, 291, 108757.
  31. Soleimannejad, L., Bonyad, A. E., Naghdi, R., & Latifi, H. (2018). Classification of quantitative attributes of Zagros forest using Landsat 8-OLI and Random Forest algorithm (Case study: Protected area of Manesht forests). Journal of Forest Research and Development, 4(4), 415-434.
  32. Soltani, M., Jahan, M., & Yaghoubi, F. (2025). Evaluation of power and accuracy of AgMERRA and ERA5 dataset to simulate long term data for temperature and radiation in grat Khorasan province. (Under referee process)
  33. Taiz, L., Zeiger, E., Moller, I. M., & Murphy, A. (2018). Fundamentals of Plant Physiology. New York, USA: Oxford University Press. ISBN 978160535790
  34. Ting, K. M. (2011). Confusion Matrix. In C. Sammut & G. I. Webb, (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_157
  35. Vieira, H. V., Bradford, B. Z., Osterholzer, A., Pierce, E. S., Cockrell, D., Peairs, F., Frost, K., Groves, R., & Nachappa, P. (2025). A new growing degree-day phenology model for wheat stem sawfly (Hymenoptera: Cephidae) in Colorado wheat fields. Plos One, 20(4), e0320497. https://doi.org/10.1371/journal.pone.0320497
  36. Wang, Z., & Li, S. (2002). Effects of water deficit and supplemental irrigation at different growing stages on uptake and distribution of nitrogen, phosphorus, and potassium in winter wheat. Journal of Plant Nutrition and Fertilizers, 8(3), 265–270. https://dx.doi.org/10.11674/zwyf.2002.0302
  37. Yaghoubi, F., Bannayanm, M., & Asadi, G. (2020). Performance of predicted evapotranspiration and yield of rainfed wheat in the northeast Iran using gridded AgMERRA weather data. International Journal of Biometeorology, 64, 1519–1537. https://doi.org/10.1007/s00484-020-01931-y
CAPTCHA Image