This assessment relates to the following module learning outcomes:
A. Knowledge and Understanding A1. Understand the potential of KDD and data mining for developing scorecards.
B. Subject Specific Intellectual and
Research Skills
B1. Work with software to develop credit scoring solutions; develop a scorecard using data
mining techniques.
C. Transferable and Generic Skills C1. Critically analyse practical difficulties that arise when implementing scorecards; understand
the cross-fertilisation potential to other business contexts (e.g. fraud detection, CRM).
Coursework Brief:
Question 1 (65 marks)
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.
Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This requires banks to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
Historical data (cs-training.csv) are provided on 150,000 borrowers. The following variables are available to you:
Variable Name Description Type
Serious Dlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
Revolving Utilization Of Unsecured Lines Total balance on credit cards and personal lines of credit except real estate and no instalment debt like car loans divided by the sum of credit limits. percentage
age Age of borrower in years integer
Number Of Time 30-59 Days Past Due Not Worse Number of times borrower has been 30-59 days past due but no worse in the last 2 years integer
Debt Ratio Monthly debt payments, alimony, living costs divided by monthly gross income. percentage
Monthly Income Monthly income real
Number Of Open Credit Lines And Loans Number of Open loans (instalment like car loan or mortgage) and Lines of credit (e.g. credit cards integer
Number Of Times 90 Days Late Number of times borrower has been 90 days or more past due integer
Number Real Estate Loans Or Lines. Number of mortgage and real estate loans including home equity lines of credit integer
Number Of Time60-89 Days Past Due Not Worse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
Number Of Dependents Number of dependents in family excluding themselves (spouse, children etc.) integer
The goal of Question 1 is to build a model from training dataset that banks can use to help make the best financial decisions on borrowers in testing dataset (cs-test.csv).
1.1 Carefully pre-process the dataset by considering the following activities:
• Exploratory data analysis.
• Missing value handling (if any). Marks will be discounted by just replacing by a value, a correct study of missing values is necessary.
• Outlier detection and treatment (if any). Marks will be discounted by just eliminating or replacing by a value without justification, a correct study of outliers is necessary.
1.2 Build a credit scoring model in which SeriousDlqin2yrs is used as a target (default) and report the following:
• What method do you use?
• Why you use this method?
• Discuss your results.
• The most important variables
• The impact of the variables on the target
• The performance of the model. Use various performance metrics and discuss their relationship if any.
• What do banks win and lose by doing this?
In terms of software, use SAS Enterprise Miner or anything else (e.g., Python, R and so on). Carefully report the various steps of your methodology and discuss your results in a rigorous way!
Question 2 (35 marks)
Find an academic or business paper published in 2019 or later discussing a real-life application of data mining or credit scoring. It is important that the case considered is a real-life case and not an artificial one. Some suggested journals are:
• Management Science
• Operations Research
• INFORMS Journal on Computing
• INFORMS Journal on Applied Analytics
• Journal of Machine Learning Research
• European Journal of Operational Research
• ICDM (The IEEE International conference on data mining)
• NeurlPS (Conference on Neural Information Processing Systems)
• KDD (ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
Once you have found an appropriate paper, report the following in separate sections:
• Title, authors and complete citation (journal name, book title, issue, year, …)
• The data mining problem considered
• The data mining techniques used
• The results reported
• A critical discussion of the model and results (assumptions made, shortcomings, limitations, …)
Make sure you demonstrate that you understand what the article is all about!
Word Limit: +/-10% either side of the word count (see above) is deemed to be acceptable. Any text that exceeds an additional 10% will not attract any marks. The relevant word count includes items such as cover page, executive summary, title page, table of contents, tables, figures, in-text citations and section headings, if used. The relevant word count excludes your list of references and any appendices at the end of your coursework submission.
You should always include the word count (from Microsoft Word, not Turnitin), at the end of your coursework submission, before your list of references.
Title/Cover Page: You must include a title/ cover page that includes: your Student ID, Module Code, Assignment Title,
Word Count. This assignment will be marked anonymously, please ensure that your name does not appear on any part of your assignment.
References: You should use the Harvard style to reference your assignment. The library provide guidance on how to reference in the Harvard style .