An End-to-End CRISP-DM Machine Learning Pipeline for Forecasting Demand in FMCG Chain Stores

Document Type : Research Paper

Authors

1 Department of Industrial Management, Faculty of Management, University of Tehran, Tehran, Iran

2 Department of Industrial Management, Faculty of Management and Accounting, Allameh Tabatabai University, Tehran, Iran

3 Faculty of Industrial and Systems Engineering, Tarbiat Modares University, Tehran, Iran

Abstract

Objective: Accurate forecasting of customer demand is necessary to optimize the efficiency of a supply chain, maximize profits through reduced inventory costs, and increase customer satisfaction. This research presents a new machine learning methodology based on the CRISP-DM for customer order forecasting that is both interpretive and interpretable and validates it with a real-world application from the Ofogh Kourosh Company, which offers the largest number of physical retail locations in Iran.
Methods: The dataset analyzed for this research contained 844,275 sales transactions from 40 separate physical locations. Six advanced ensemble machine learning models were developed to forecast customer order demand. A beneficial factor of this research was the ability to automate hyperparameter tuning of the six predictive models using the Optuna framework. The performance of the predictive models was then evaluated using MAE, RMSE, MSE, and R² metrics.
Results: Based on R² score, LightGBM was the most accurate predictive model with an R² score of 0.536. Feature importance analysis from LightGBM demonstrated that the three factors that would most determine customer order demand were the percentage of discount, price, and store location.
Conclusion: This research contributes both theoretically and practically to the development of a forecast model that is regionally, culturally, and contextually relevant within the Iranian retail marketplace. Compared to the literature, this study uses actual transactional data with ML models to narrow the theory-practice gap. Future research should emanate from this development, incorporating external influences such as climate, advertising, and macroeconomic influences for even greater forecast accuracy

Keywords


Abolghasemi, M., Gerlach, R., Tarr, G., & Beh, E. (2019). Demand forecasting in supply chain: The impact of demand volatility in the presence of promotion. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1909.13084
Adler, A. I., & Painsky, A. (2022). Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection. Entropy, 24(5), 687. https://doi.org/10.3390/e24050687
Afshar, F., Seyedabrishami, S., & Moridpour, S. (2022). Application of Extremely Randomised Trees for exploring influential factors on variant crash severity data. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-022-15693-7
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna. KDD ’19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631. https://doi.org/10.1145.3292500.3330701
Anchuri, N. S. (2024). Machine Learning-Driven Demand Forecasting: A comparative analysis of advanced techniques and Real-Time integration. International Journal of Scientific Research in Computer Science Engineering and Information Technology, 10(6), 1352–1361. https://doi.org/10.32628/cseit241061175
Barreñada, L., Dhiman, P., Timmerman, D., Boulesteix, A., & Van Calster, B. (2024). Understanding overfitting in random forest for probability estimation: a visualization and simulation study. Diagnostic and Prognostic Research, 8(1). https://doi.org/10.1186/s41512-024-00177-1
Basavaraju, K., & Valilai, O. F. (2025). Developing a demand planning strategy for joint forecasting and employing analytical tool in an empirical case study. Deleted Journal, 7(4). https://doi.org/10.1007/s42452-025-06740-9
Basson, L. M., Kilbourn, P. J., & Walters, J. (2019). Forecast accuracy in demand planning: A fast-moving consumer goods case study. Journal of Transport and Supply Chain Management, 13. https://doi.org/10.4102/jtscm.v13i0.427
Carbonneau, R., Vahidov, R., & Laframboise, K. (2009). Forecasting supply chain demand using machine learning algorithms. In Advances in intelligent information technologies series/Advances in intelligent information technologies (AIIT) book series (pp. 328–365). https://doi.org/10.4018.978-1-60566-144-5.ch018
Chen, W., Yang, H., Yin, L., & Luo, X. (2024). Large-scale IoT attack detection scheme based on LightGBM and feature selection using an improved salp swarm algorithm. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-69968-2
Demirtürk, D., Mintemur, Ö., & Arslan, A. (2025). Optimizing LightGBM and XGBOOST algorithms for estimating compressive strength in High-Performance Concrete. Arabian Journal for Science and Engineering. https://doi.org/10.1007/s13369-025-10217-7
Douaioui, K., Oucheikh, R., Benmoussa, O., & Mabrouki, C. (2024). Machine Learning and Deep Learning Models for Demand Forecasting in Supply Chain Management: A Critical review. Applied System Innovation, 7(5), 93. https://doi.org/10.3390/asi7050093
Durap, A. (2023). A comparative analysis of machine learning algorithms for predicting wave runup. Anthropocene Coasts, 6(1). https://doi.org/10.1007/s44218-023-00033-7
Fatima, S. S. W., & Rahimi, A. (2024). A review of Time-Series Forecasting Algorithms for Industrial Manufacturing Systems. Machines, 12(6), 380. https://doi.org/10.3390/machines12060380
Feizabadi, J. (2020). Machine learning demand forecasting and supply chain performance. International Journal of Logistics Research and Applications, 25(2), 119–142. https://doi.org/10.1080.13675567.2020.1803246
Fildes, R., Kolassa, S., & Ma, S. (2021). Post-script—Retail forecasting: Research and practice. International Journal of Forecasting, 38(4), 1319–1324. https://doi.org/10.1016/j.ijforecast.2021.09.012
Fırat, A. T., Aygün, O., Göğebakan, M., Akay, M. F., & Ulus, C. (2024). Development of machine learning based demand forecasting models for the e-commerce sector. Uluslararası Mühendislik Tasarım Ve Teknoloji Dergisi., 7(1), 13–20. https://doi.org/10.70669/ijedt.1567739
Geeitha, S., Ravishankar, K., Cho, J., & Easwaramoorthy, S. V. (2024). Integrating cat boost algorithm with triangulating feature importance to predict survival outcome in recurrent cervical cancer. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-67562-0
Ghosh, D., & Cabrera, J. (2021). Enriched random forest for high dimensional genomic data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5), 2817–2828. https://doi.org/10.1109/tcbb.2021.3089417
Golabek, M., Senge, R., & Neumann, R. (2020). Demand Forecasting using Long Short-Term Memory Neural Networks. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2008.08522
Hancock, J. T., & Khoshgoftaar, T. M. (2020). CatBoost for big data: an interdisciplinary review. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-020-00369-8
Ileri, K. (2025). Comparative analysis of CatBoost, LightGBM, XGBoost, RF, and DT methods optimised with PSO to estimate the number of k-barriers for intrusion detection in wireless sensor networks. International Journal of Machine Learning and Cybernetics. https://doi.org/10.1007/s13042-025-02654-5
Jafarnejad, A., Rezasoltani, A., & Khani, A. M. (2025). Cost-sensitive machine learning for predicting production defects: A novel approach based on MetaCost. Research in Production and Operations Management, 16(2), 73–94. https://doi.org/10.22108/pom.2025.144489.1610
Jahin, M. A., Shahriar, A., & Amin, M. A. (2024). MCDFN: Supply chain demand Forecasting via an explainable Multi-Channel Data Fusion network model integrating CNN, LSTM, and GRU. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2405.15598
Khani, A. M., Kazazi, A., & Taqhavi Fard, M. T. (2022). Evaluating the quality of services of the cultural and social deputy of Tehran Municipality in the field of culture and art. Social Development & Welfare Planning, 13(50), 205–250. https://doi.org/10.22054/qjsd.2021.58035.2110
Khani,A. M. , Rezasoltani,A. , Arjmandpour,S. , Jafarnjad,A. and Hosseinian,S. H. (2025). The Impact of Total Quality Management and Visual Quality on Customer Satisfaction and Loyalty in the Apparel Industry: A Hybrid Approach Using PLS-SEM and SHAP. Journal of Advertising and Sales Management6(2), 153-176. https://doi.org/10.22034/asm.2025.2064300.3407
Lai, J., Lin, Y., Lin, H., Shih, C., Wang, Y., & Pai, P. (2023). Tree-Based Machine Learning Models with Optuna in Predicting Impedance Values for Circuit Analysis. Micromachines, 14(2), 265. https://doi.org/10.3390/mi14020265
Lai, L., Lin, Y., Liu, Y., Lai, J., Yang, W., Hou, H., & Pai, P. (2024). The Use of Machine Learning Models with Optuna in Disease Prediction. Electronics, 13(23), 4775. https://doi.org/10.3390/electronics13234775
Le, H., Sang, V. N. T., Thuy, L. N. L., & Bao, P. (2023). The fuzzy Kullback–Leibler divergence for estimating parameters of the probability distribution in fuzzy data: an application to classifying Vietnamese Herb Leaves. Scientific Reports, 13(1). https://doi.org/10.1038/s41598-023-40992-y
Lee, K. H., Abdollahian, M., Schreider, S., & Taheri, S. (2023). Supply Chain Demand Forecasting and Price Optimisation Models with Substitution Effect. Mathematics, 11(11), 2502. https://doi.org/10.3390/math11112502
Lu, J., Li, J., Ren, J., Ding, S., Zeng, Z., Huang, T., & Cai, Y. (2022). Functional and embedding feature analysis for pan-cancer classification. Frontiers in Oncology, 12. https://doi.org/10.3389/fonc.2022.979336
Meaney, C., Wang, X., Guan, J., & Stukel, T. A. (2025). Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users. BMC Medical Research Methodology, 25(1). https://doi.org/10.1186/s12874-025-02561-x
MebalP, A., S, H., SJ, J., & M, M. (2021). Predicting the Demand for Fmcg using Machine Learning. International Journal of Engineering and Advanced Technology, 10(3), 169–171. https://doi.org/10.35940/ijeat.c2253.0210321
Mehregan, M. R., & Khani, A. M. (2024). Improving organizational performance: The role of supply chain 4.0 and financing in reducing supply chain risk. Journal of International Business Administration, 7(3), 39–59. https://doi.org/10.22034/jiba.2024.60005.2164
Mitra, A., Jain, A., Kishore, A., & Kumar, P. (2022). A Comparative Study of demand Forecasting Models for a Multi-Channel Retail Company: A novel hybrid Machine learning approach. Operations Research Forum, 3(4). https://doi.org/10.1007/s43069-022-00166-4
Nweje, N. U., & Taiwo, N. M. (2025). Leveraging Artificial Intelligence for predictive supply chain management, focus on how AI- driven tools are revolutionizing demand forecasting and inventory optimization. International Journal of Science and Research Archive, 14(1), 230–250. https://doi.org/10.30574/ijsra.2025.14.1.0027
Olutimehin, N. D. O., Nwankwo, N. E. E., Ofodile, N. O. C., & Ugochukwu, N. C. E. (2024). STRATEGIC OPERATIONS MANAGEMENT IN FMCG: A COMPREHENSIVE REVIEW OF BEST PRACTICES AND INNOVATIONS. International Journal of Management & Entrepreneurship Research, 6(3), 780–794. https://doi.org/10.51594/ijmer.v6i3.935
Oyeyemi, N. O. P., Anjorin, N. K. F., Ewim, N. S. E., Igwe, N. a. N., & Sam-Bulya, N. N. J. (2024). The influence of supply chain agility on FMCG SME marketing flexibility and customer satisfaction. International Journal of Applied Research in Social Sciences, 6(10), 2546–2563. https://doi.org/10.51594/ijarss.v6i10.1665
Perumallaplli, R. (2025). Machine learning approaches for improving supply chain efficiency and demand prediction. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.5228503
Rezasoltani, A., Jafarnejad, A., & Khani, A. M. (2025). A voting-based hybrid machine learning model for predicting backorders in the supply chain. Journal of Decisions and Operations Research, 10(1), 194–213. https://doi.org/10.22105/dmor.2025.511401.1924
Rezvan, P. H., Comulada, W. S., Fernández, M. I., & Belin, T. R. (2022). Assessing alternative imputation strategies for infrequently missing items on multi-item scales. Communications in Statistics Case Studies Data Analysis and Applications, 8(4), 682–713. https://doi.org/10.1080.23737484.2022.2115430
Rizkallah, L. W. (2025). Enhancing the performance of gradient boosting trees on regression problems. Journal of Big Data, 12(1). https://doi.org/10.1186/s40537-025-01071-3
Schröer, C., Kruse, F., & Gómez, J. M. (2021). A Systematic Literature Review on Applying CRISP-DM Process model. Procedia Computer Science, 181, 526–534. https://doi.org/10.1016/j.procs.2021.01.199
Shakur, M. S., Lubaba, M., Debnath, B., Bari, A. B. M. M., & Rahman, M. A. (2024). Exploring the challenges of Industry 4.0 adoption in the FMCG sector: Implications for Resilient Supply Chain in Emerging economy. Logistics, 8(1), 27. https://doi.org/10.3390/logistics8010027
Tripathi, S., Muhr, D., Brunner, M., Jodlbauer, H., Dehmer, M., & Emmert-Streib, F. (2021). Ensuring the robustness and reliability of Data-Driven Knowledge discovery models in production and manufacturing. Frontiers in Artificial Intelligence, 4. https://doi.org/10.3389/frai.2021.576892
Wang, Q. (2025). A Hybrid Transformer-ARIMA model for forecasting global supply chain disruptions using multimodal data. International Journal of Advanced Computer Science and Applications, 16(1). https://doi.org/10.14569/ijacsa.2025.0160153
Watanabe, S. (2023). Tree-Structured Parzen Estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.11127
Wiemer, H., Drowatzky, L., & Ihlenfeldt, S. (2019). Data Mining Methodology for Engineering Applications (DMME)—A holistic extension to the CRISP-DM model. Applied Sciences, 9(12), 2407. https://doi.org/10.3390/app9122407
Wiens, M., Verone‐Boyle, A., Henscheid, N., Podichetty, J. T., & Burton, J. (2025). A tutorial and use case example of the eXtreme Gradient Boosting (XGBOOST) artificial intelligence algorithm for drug development applications. Clinical and Translational Science, 18(3). https://doi.org/10.1111/cts.70172
Wu, R. M. X., Shafiabady, N., Zhang, H., Lu, H., Gide, E., Liu, J., & Charbonnier, C. F. B. (2024). Comparative study of ten machine learning algorithms for short-term forecasting in gas warning systems. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-67283-4
Xu, X., Xia, L., Zhang, Q., Wu, S., Wu, M., & Liu, H. (2020). The ability of different imputation methods for missing values in mental measurement questionnaires. BMC Medical Research Methodology, 20(1). https://doi.org/10.1186/s12874-020-00932-0
Yang, D., & Zhang, A. N. (2019). Impact of information sharing and forecast combination on Fast-Moving-Consumer-Goods demand forecast accuracy. Information, 10(8), 260. https://doi.org/10.3390/info10080260
Yani, L. P. E., & Aamer, A. (2022). Demand forecasting accuracy in the pharmaceutical supply chain: a machine learning approach. International Journal of Pharmaceutical and Healthcare Marketing, 17(1), 1–23. https://doi.org/10.1108/ijphm-05-2021-0056
Zheng, Y. (2024). Application of Machine Learning Algorithms in Enterprise Supply Chain Demand Forecasting. 2024 2nd International Conference on Design Science (ICDS), 1-4., 1–4. https://doi.org/10.1109/icds62420.2024.10751682
Zhou, W., Yan, Z., & Zhang, L. (2024). A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-55243-x
Zohdi, M., Rafiee, M., Kayvanfar, V., & Salamiraad, A. (2022). Demand forecasting based machine learning algorithms on customer information: an applied approach. International Journal of Information Technology, 14(4), 1937–1947. https://doi.org/10.1007/s41870-022-00875-3