Wednesday, August 30, 2017

select records based on a list of indexies



BEGIN{
  FS="|"
  OFS="|"

  while ((getline < ("common_LoanNumber.pip")) > 0) {
  id[$1]=$1;
  }
}

{
  appid = $1;
  if(appid in id) {print $0;}

}

Monday, August 28, 2017

join multiple text files with same headers

If each file has a header line but in the output file you only want to have one header line:

awk '
    FNR==1 && NR!=1 { while (/^LoanNumber/) getline; }
    1 {print}
'  input_file1 input_file2 input_file3 ... > output_file

Friday, August 18, 2017

AWK splice text into chunks

This code slice the first 10K rows into F1, then next 10K rows into F2, and so on
awk 'NR%10000==1{x="F"++i;}{print > x}'  filename

Tuesday, August 15, 2017

plot a ROC curve

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator

#plt.switch_backend('agg')

dt = pd.read_csv('cs_2017_roc_gain.csv', sep=",", header=0)
fig = plt.gcf()
plt.plot(dt.x_gain * 100.0, dt.y * 100.0)
plt.xlim([0, 100])
plt.ylim([0, 100])
ax = fig.gca()
ax.set_xticks(np.arange(0, 101, 10))
ax.set_yticks(np.arange(0, 101, 10))

minor_locator = AutoMinorLocator(2)
ax.xaxis.set_minor_locator(minor_locator)
ax.yaxis.set_minor_locator(minor_locator)

ax.set_xlabel(r'% of Transactions')
ax.set_ylabel(r'% of Chargeoff Accounts')
fig.suptitle('Consumer')
#ax.minorticks_off()
plt.grid(which='minor')
plt.grid(which='major')

plt.show()
fig.savefig('Consumer_2017.jpg', dpi=300)

print dt.head()

ROC vs Gain

ROC ---->> non-target percent

Gain --->> total-population percent

Saturday, August 12, 2017

2017-08-11

Working on LogisticRegression today:





* Adding class_weight = 'balanced' significantly improves the f-score: the recall is enhanced but precision reduced.







Shiny App Update:



The data originally has 652303 obs. of 299 variables



Frank added these filters:



dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(app_scr > 700) %>%

filter(income > 10000) %>%

filter(applicant_age > 30) %>%

filter(time_in_file < 25) %>%

filter(high_credit_amount < 15000) %>%

filter(no_of_trades < 6) %>%

filter(selling_price > 19000)



The resulted file 'High86.rds' has 86 obs. of 299 variables now.



Changed filters again:



dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(app_scr > 500) %>%

filter(income > 8000) %>%

filter(applicant_age > 30) %>%

filter(time_in_file < 25,

time_in_file > 0) %>%

filter(high_credit_amount < 15000) %>%

filter(no_of_trades < 6) %>%

filter(selling_price > 15000)



The resulted file 'High61.rds' has 61 obs. of 299 variables now.





Then updated the filters again:



dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(app_scr > 300) %>%

filter(income > 8000) %>%

filter(applicant_age > 40) %>%

filter(time_in_file < 36,

time_in_file > 0) %>%

filter(high_credit_amount < 15000) %>%

filter(no_of_trades < 6) %>%

filter(selling_price > 1500) %>%

filter(application_status == 'A')



The resulted file 'High4.rds' only has 4 records.



Then updated again:



dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(income > 6000) %>%

filter(app_scr > 400) %>%

filter(str_detect(dealer_scr_reason1, 'Matches to Risky Dealer in Consortium') | str_detect(dealer_scr_reason2, 'Matches to Risky Dealer in Consortium') | str_detect(dealer_scr_reason3, 'Matches to Risky Dealer in Consortium')) %>%

filter(str_detect(empl_name, 'LLC') | str_detect(empl_name, 'venture') | str_detect(empl_name, 'consult') | str_detect(empl_name, 'enterprise'))



The resulted file is 'High42.rds'





Then update:



dt = dtx %>%

filter(income > 8000,

time_in_file > 0,

time_in_file < 36,

applicant_age > 30,

application_status == 'A',

str_detect(addr_city, regex('Miami', ignore_case=T)) | str_detect(addr_city, regex('chicago', ignore_case=T)) | str_detect(addr_city, regex('Houston', ignore_case=T))| str_detect(addr_city, regex('Baltimore', ignore_case=T)) | str_detect(addr_city, regex('Los Angeles', ignore_case=T)) )





The resulted file is 'High21.rds'.









cygwin commands:

# sort

cat all_with_appscr_dlrscr_v2.filtered.pip |sort -t"|" -k2,2n > all_with_appscr_dlrscr_v2.filtered.sorted.pip

# join

join -1 2 -2 1 -t"|" all_with_appscr_dlrscr_v2.filtered.sorted.pip appids2reasons.pip > all_with_appscr_dlrscr_appreasons.pip

Tuesday, August 8, 2017

2017-08-08

Tasks finished:

-- Re-run Joel's Proxy model (folder 09) Random Forest model

cat featuresall.pip | python.exe ./scorerfmodel.py >  tagscrall-RF.csv

This won't work because the single-thread computing takes ~500 hours to do the job.

-- Parallelized the Random Forest model by using  

1. model.set_params(n_jobs = 30)
2. parse all features and then dump it to the model at once

This is fast -- but dumping all the records took a huge amount of RAM, and during the physical memory is full and HDD swap is used, the speed is slow again.

So the final solution is to cut the input features into about 16 chunks, each has 1 million records, except the last chunk only has about half a million. The process each chunk separately. This way the prediction is still multi-thread and the RAM usage is low.


 


Initial Test

"Hello World!"

--
Best Regards,
Shi

my-alpine and docker-compose.yml

 ``` version: '1' services:     man:       build: .       image: my-alpine:latest   ```  Dockerfile: ``` FROM alpine:latest ENV PYTH...