MATLAB ... and more ...: 2017

Monday, October 30, 2017

Concatenate multiple files with same headers (and only keep one header line in the output file)

awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' file*.txt >all.txt

Note: the /^<header>/ part need to be changed to adapt to whatever the actual header is.

Tuesday, October 17, 2017

R - conversion

Don't directly convert factor to numeric! Change to characters first!

df$a = as.numeric(df$a) ---> NOT GOOD!

df$a = as.numeric(as.character(df$a)) ---> GOOD!

Monday, October 9, 2017

convert the first tab into pip

cat SOC.csv | sed -e "s/\t/$(printf '|')/" > SOC.pip

Sanitize U.S. States Names

# load data

StateData = read.csv('65States.pip', sep="|", col.names = c("FullName", "Abbr"))

StateFullName = toupper(StateData$FullName)

StateAbbr = as.vector(StateData$Abbr)

# define a function

sanitizeState = function(inputcol, StateFullName, StateAbbr){

match = amatch(inputcol, StateFullName, maxDist=1)

inputcol[!is.na(match)] = StateAbbr[na.omit(match)]

return (inputcol)

}

# use the function

df$State = sanitizeState(gls$State,StateFullName, StateAbbr )

Wednesday, September 20, 2017

python - read large csv into pandas by chunks

chunks=pd.read_table('filename', chunksize=500000)
df=pd.DataFrame()
df=pd.concat((chunk==1) for chunk in chunks)

To remove duplicated rows:

awk '!seen[$0]++' <filename>

To remove rows with duplicated field (say $1 is ID and need to remove the entire row if ID is duplicated):

awk '!seen[$1]++' <filename>

Tuesday, September 19, 2017

filter a file based on tokens in another file

BEGIN{

FS="|"

OFS="|"

while ((getline < (“Token_list_file.csv")) > 0) {

id[$1]=$1;

}

{

appid = $1;

if(appid in id) {print $0;}

}

Wednesday, August 30, 2017

select records based on a list of indexies

BEGIN{

FS="|"

OFS="|"

while ((getline < ("common_LoanNumber.pip")) > 0) {

id[$1]=$1;

}

{

appid = $1;

if(appid in id) {print $0;}

}

Monday, August 28, 2017

join multiple text files with same headers

If each file has a header line but in the output file you only want to have one header line:

awk '

FNR==1 && NR!=1 { while (/^LoanNumber/) getline; }

1 {print}

' input_file1 input_file2 input_file3 ... > output_file

Friday, August 18, 2017

AWK splice text into chunks

This code slice the first 10K rows into F1, then next 10K rows into F2, and so on

awk 'NR%10000==1{x="F"++i;}{print > x}'  filename

Tuesday, August 15, 2017

plot a ROC curve

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.ticker import AutoMinorLocator

#plt.switch_backend('agg')

dt = pd.read_csv('cs_2017_roc_gain.csv', sep=",", header=0)

fig = plt.gcf()

plt.plot(dt.x_gain * 100.0, dt.y * 100.0)

plt.xlim([0, 100])

plt.ylim([0, 100])

ax = fig.gca()

ax.set_xticks(np.arange(0, 101, 10))

ax.set_yticks(np.arange(0, 101, 10))

minor_locator = AutoMinorLocator(2)

ax.xaxis.set_minor_locator(minor_locator)

ax.yaxis.set_minor_locator(minor_locator)

ax.set_xlabel(r'% of Transactions')

ax.set_ylabel(r'% of Chargeoff Accounts')

fig.suptitle('Consumer')

#ax.minorticks_off()

plt.grid(which='minor')

plt.grid(which='major')

plt.show()

fig.savefig('Consumer_2017.jpg', dpi=300)

print dt.head()

ROC vs Gain

ROC ---->> non-target percent

Gain --->> total-population percent

Saturday, August 12, 2017

2017-08-11

Working on LogisticRegression today:

* Adding class_weight = 'balanced' significantly improves the f-score: the recall is enhanced but precision reduced.

Shiny App Update:

The data originally has 652303 obs. of 299 variables

Frank added these filters:

dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(app_scr > 700) %>%

filter(income > 10000) %>%

filter(applicant_age > 30) %>%

filter(time_in_file < 25) %>%

filter(high_credit_amount < 15000) %>%

filter(no_of_trades < 6) %>%

filter(selling_price > 19000)

The resulted file 'High86.rds' has 86 obs. of 299 variables now.

Changed filters again:

dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(app_scr > 500) %>%

filter(income > 8000) %>%

filter(applicant_age > 30) %>%

filter(time_in_file < 25,

time_in_file > 0) %>%

filter(high_credit_amount < 15000) %>%

filter(no_of_trades < 6) %>%

filter(selling_price > 15000)

The resulted file 'High61.rds' has 61 obs. of 299 variables now.

Then updated the filters again:

dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(app_scr > 300) %>%

filter(income > 8000) %>%

filter(applicant_age > 40) %>%

filter(time_in_file < 36,

time_in_file > 0) %>%

filter(high_credit_amount < 15000) %>%

filter(no_of_trades < 6) %>%

filter(selling_price > 1500) %>%

filter(application_status == 'A')

The resulted file 'High4.rds' only has 4 records.

Then updated again:

dt = dtx %>%

filter(coapplicant_first_name == "") %>%

filter(income > 6000) %>%

filter(app_scr > 400) %>%

filter(str_detect(dealer_scr_reason1, 'Matches to Risky Dealer in Consortium') | str_detect(dealer_scr_reason2, 'Matches to Risky Dealer in Consortium') | str_detect(dealer_scr_reason3, 'Matches to Risky Dealer in Consortium')) %>%

filter(str_detect(empl_name, 'LLC') | str_detect(empl_name, 'venture') | str_detect(empl_name, 'consult') | str_detect(empl_name, 'enterprise'))

The resulted file is 'High42.rds'

Then update:

dt = dtx %>%

filter(income > 8000,

time_in_file > 0,

time_in_file < 36,

applicant_age > 30,

application_status == 'A',

str_detect(addr_city, regex('Miami', ignore_case=T)) | str_detect(addr_city, regex('chicago', ignore_case=T)) | str_detect(addr_city, regex('Houston', ignore_case=T))| str_detect(addr_city, regex('Baltimore', ignore_case=T)) | str_detect(addr_city, regex('Los Angeles', ignore_case=T)) )

The resulted file is 'High21.rds'.

cygwin commands:

# sort

cat all_with_appscr_dlrscr_v2.filtered.pip |sort -t"|" -k2,2n > all_with_appscr_dlrscr_v2.filtered.sorted.pip

# join

join -1 2 -2 1 -t"|" all_with_appscr_dlrscr_v2.filtered.sorted.pip appids2reasons.pip > all_with_appscr_dlrscr_appreasons.pip

Tuesday, August 8, 2017

2017-08-08

Tasks finished:

-- Re-run Joel's Proxy model (folder 09) Random Forest model

cat featuresall.pip | python.exe ./scorerfmodel.py > tagscrall-RF.csv

This won't work because the single-thread computing takes ~500 hours to do the job.

-- Parallelized the Random Forest model by using

1. model.set_params(n_jobs = 30)

2. parse all features and then dump it to the model at once

This is fast -- but dumping all the records took a huge amount of RAM, and during the physical memory is full and HDD swap is used, the speed is slow again.

So the final solution is to cut the input features into about 16 chunks, each has 1 million records, except the last chunk only has about half a million. The process each chunk separately. This way the prediction is still multi-thread and the RAM usage is low.

Initial Test

"Hello World!"

Best Regards,

Shi

MATLAB ... and more ...