Working on LogisticRegression today:
* Adding class_weight = 'balanced' significantly improves the f-score: the recall is enhanced but precision reduced.
Shiny App Update:
The data originally has 652303 obs. of 299 variables
Frank added these filters:
dt = dtx %>%
filter(coapplicant_first_name == "") %>%
filter(app_scr > 700) %>%
filter(income > 10000) %>%
filter(applicant_age > 30) %>%
filter(time_in_file < 25) %>%
filter(high_credit_amount < 15000) %>%
filter(no_of_trades < 6) %>%
filter(selling_price > 19000)
The resulted file 'High86.rds' has 86 obs. of 299 variables now.
Changed filters again:
dt = dtx %>%
filter(coapplicant_first_name == "") %>%
filter(app_scr > 500) %>%
filter(income > 8000) %>%
filter(applicant_age > 30) %>%
filter(time_in_file < 25,
time_in_file > 0) %>%
filter(high_credit_amount < 15000) %>%
filter(no_of_trades < 6) %>%
filter(selling_price > 15000)
The resulted file 'High61.rds' has 61 obs. of 299 variables now.
Then updated the filters again:
dt = dtx %>%
filter(coapplicant_first_name == "") %>%
filter(app_scr > 300) %>%
filter(income > 8000) %>%
filter(applicant_age > 40) %>%
filter(time_in_file < 36,
time_in_file > 0) %>%
filter(high_credit_amount < 15000) %>%
filter(no_of_trades < 6) %>%
filter(selling_price > 1500) %>%
filter(application_status == 'A')
The resulted file 'High4.rds' only has 4 records.
Then updated again:
dt = dtx %>%
filter(coapplicant_first_name == "") %>%
filter(income > 6000) %>%
filter(app_scr > 400) %>%
filter(str_detect(dealer_scr_reason1, 'Matches to Risky Dealer in Consortium') | str_detect(dealer_scr_reason2, 'Matches to Risky Dealer in Consortium') | str_detect(dealer_scr_reason3, 'Matches to Risky Dealer in Consortium')) %>%
filter(str_detect(empl_name, 'LLC') | str_detect(empl_name, 'venture') | str_detect(empl_name, 'consult') | str_detect(empl_name, 'enterprise'))
The resulted file is 'High42.rds'
Then update:
dt = dtx %>%
filter(income > 8000,
time_in_file > 0,
time_in_file < 36,
applicant_age > 30,
application_status == 'A',
str_detect(addr_city, regex('Miami', ignore_case=T)) | str_detect(addr_city, regex('chicago', ignore_case=T)) | str_detect(addr_city, regex('Houston', ignore_case=T))| str_detect(addr_city, regex('Baltimore', ignore_case=T)) | str_detect(addr_city, regex('Los Angeles', ignore_case=T)) )
The resulted file is 'High21.rds'.
cygwin commands:
# sort
cat all_with_appscr_dlrscr_v2.filtered.pip |sort -t"|" -k2,2n > all_with_appscr_dlrscr_v2.filtered.sorted.pip
# join
join -1 2 -2 1 -t"|" all_with_appscr_dlrscr_v2.filtered.sorted.pip appids2reasons.pip > all_with_appscr_dlrscr_appreasons.pip