Document Type: Original Article


1 Management school, Kharazmi University, Tehran, Iran.

2 Shahid Beheshti University, Tehran, Iran.


Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to build their default prediction models. However, in practice the data records are usually incomplete and have some missing values and this make problems for banks, especially in credit risk portfolios which are low default and makes model rule based building complex. Several strategies could be used in order to handle the missing data issue. This paper used five missing value handling strategies including; ignoring, replacing with random, mean, C&R tree induced values and elimination strategies in a real credit scoring dataset. Experimental results show that ignoring strategy consistently outperforms other methods on test data set, and suggest that the CHAID is a useful classifier for handling low default portfolios with missing value. 


Aldamak, A., and S. Zolfaghari, (2017). "Review of efficiency ranking methods in data envelopment analysis", Measurement, Vol. 106, pp. 161-172.

Baesens, B., Sentiono, Rudy, Mues, Christphe, and Vanthienen, Jan, (2003). "Using neural network rule extraction and decision tables for credit-risk evaluation", Management Science, Vol. 49, No. 3, pp. 312-329.

Banker, R. D., Charnes, A., and Cooper, W., (1984). "Some Models for Estimating Technical and Scale Inefficiencies in Data Envelopment Analysis", Management Science, Vol.  30, Vol. 9, pp. 1078-1092.

Ben-David, A., (2008). "Rule effectiveness in rule-based systems: A credit scoring case study", Expert Systems with Applications, Vol. 34, No. 4, pp. 2783-2788.

Bücker, M., (2013). "Reject inference in consumer credit scoring with no ignorable missing data", Journal of Banking & Finance, Vol. 37, No. 3, pp. 1040-1045.

Crook, J., and J. Banasik (2004). "Does reject inference really improve the performance of application scoring models?", Journal of Banking & Finance, Vol. 28, No. 4, pp. 857-874.

Crook, J. N., (2007). "Recent developments in consumer credit risk assessment", European Journal of Operational Research, Vol. 183, No. 3, pp. 1447-1465.

Fei, T., and I. Hemant, (2017). "Random forest missing data algorithms" Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 10, No. 6, pp. 363-377.

Florez-Lopez, R., (2010). "Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data", Journal of the Operational Research Society, Vol. 61, No. 3, pp. 486-501.

Galán, C. O., (2017). "Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions", Journal of computational and applied mathematics, Vol. 311, pp. 704-717.

Han, J., Kamber, and M., Pei, J., (2011). Data mining: concepts and techniques, Elsevier.

Harrell, F. E., and K. L. Lee (1985). "A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality", Biostatistics: Statistics in Biomedical; Public Health; and Environmental Sciences. The Bernard G. Greenberg Volume. New York: North-Holland, pp. 333–343.

Hoffmann, F., Baesens, B., Gestel, T., and Vanthienen, J., (2007). "Inferring descriptive and approximate fuzzy rules for credit scoring using evolutionary algorithms", European Journal of Operational Research, Vol. 177, No. 1, pp. 540-555.

Huang, Z., Chen, H., Hsu, C., and Wu, S., (2004). "Credit rating analysis with support vector machines and neural networks: a market comparative study", Decision support systems, Vol. 37, No. 4, pp. 543-558.

Ibrahim, J. G., Chen, M., Lipsitz, S., and Herring, A., (2005). "Missing-data methods for generalized linear models: A comparative review", Journal of the American Statistical Association, Vol. 100, No. 469, pp. 332-346.

King, G., (2001). "Analyzing incomplete political science data: An alternative algorithm for multiple imputation", American Political Science Association, Cambridge Univ Press.

Loh, W. Y., (2008). "Classification and regression tree methods", Encyclopedia of statistics in quality and reliability.

Malhotra, R., and D. K. Malhotra (2002). "Differentiating between good credits and bad credits using neuro-fuzzy systems", European Journal of Operational Research, Vol. 136, No. 1, pp. 190-211.

Martens, D., (2007). "Comprehensible credit scoring models using rule extraction from support vector machines", European Journal of Operational Research, Vol. 183, No. 3, pp. 1466-1476.

Ong, C. S., (2005). "Building credit scoring models using genetic programming", Expert Systems with Applications, Vol. 29, No. 1, pp. 41-47.

PANG, S. l., and J. z. GONG (2009). "C5. 0 classification algorithm and application on individual credit evaluation of banks", Systems Engineering-Theory & Practice, Vol. 29, No. 12, pp. 94-104.

Quinlan, J. R., (1993). C4. 5: programs for machine learning, Morgan kaufmann.

Taylor, A. D., and A. M. Pacelli, (2008). Mathematics and politics: strategy, voting, power, and proof, Springer Science & Business Media.

Thomas, L. C., (2009). "Consumer credit models: pricing, profit, and portfolios", Oxford University Press, USA.

Van Gestel, T., and B. Baesens, (2009). "Credit risk management: basic concepts: financial risk components, rating analysis, models, economic and regulatory capital", Oxford University Press, USA.

Wiginton, J. C., (1980). "A note on the comparison of logit and discriminant models of consumer credit behavior" Journal of Financial and Quantitative Analysis, Vol.15, No. 3, pp. 757-770.

Wilkinson, L., (1992). "Tree structured data analysis: AID, CHAID and CART", Retrieved February 1: 2008.