Tuesday, March 3, 2020

Building a XGBoost model

1. pre-process :

           1.1 Boruta to clean data column-wise
           1.2 TomekLinks to clean data row-wise

2. training (use a random sample to do this step if the data size is too big):

          2.1 create a basic with fixed learning rate and n_estimators:
                                       XGBRFClassifier(objective='binary:logistic',
                                                                     n_estimator=X,
                                                                     learning_rate=0.1,
                                                                     n_jobs=-1)
         
          2.2 grid search for optimal 'max_depth' and 'min_child_weight'.
                          -- use scoring='roc_auc'
                          -- use cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3)

          2.3 get 'current_best' by using gridsearch.best_estimator, then fit current_best again.

         2.4 then grid search for 'gamma' with current_best, when it's done, get 'current_best' by using gridsearch.best_estimator, then fit current_best again.

         2.5 then grid search for 'subsample' and 'colsample_bytree', when it's done, get 'current_best' by using gridsearch.best_estimator, then fit current_best again.

         2.6 then grid search for 'learning_rate' , when it's done, get 'current_best' by using gridsearch.best_estimator, then fit current_best again.


3. final training: use all data to fit the mdl with all the optimized params.

4. Evaluatiion.
                 
               

my-alpine and docker-compose.yml

 ``` version: '1' services:     man:       build: .       image: my-alpine:latest   ```  Dockerfile: ``` FROM alpine:latest ENV PYTH...