Tuesday, August 8, 2017

2017-08-08

Tasks finished:

-- Re-run Joel's Proxy model (folder 09) Random Forest model

cat featuresall.pip | python.exe ./scorerfmodel.py >  tagscrall-RF.csv

This won't work because the single-thread computing takes ~500 hours to do the job.

-- Parallelized the Random Forest model by using  

1. model.set_params(n_jobs = 30)
2. parse all features and then dump it to the model at once

This is fast -- but dumping all the records took a huge amount of RAM, and during the physical memory is full and HDD swap is used, the speed is slow again.

So the final solution is to cut the input features into about 16 chunks, each has 1 million records, except the last chunk only has about half a million. The process each chunk separately. This way the prediction is still multi-thread and the RAM usage is low.


 


No comments:

Post a Comment

Any comments?

my-alpine and docker-compose.yml

 ``` version: '1' services:     man:       build: .       image: my-alpine:latest   ```  Dockerfile: ``` FROM alpine:latest ENV PYTH...