Wednesday, September 20, 2017

python - read large csv into pandas by chunks

chunks=pd.read_table('filename',  chunksize=500000)
df=pd.DataFrame()
df=pd.concat((chunk==1) for chunk in chunks)

remove deplicates

To remove duplicated rows:


awk '!seen[$0]++' <filename>

To remove rows with duplicated field (say $1 is ID and need to remove the entire row if ID is duplicated):

awk '!seen[$1]++' <filename>

Tuesday, September 19, 2017

filter a file based on tokens in another file

BEGIN{
  FS="|"
  OFS="|"

  while ((getline < (“Token_list_file.csv")) > 0) {
  id[$1]=$1;
  }
}

{
  appid = $1;
  if(appid in id) {print $0;}

}

my-alpine and docker-compose.yml

 ``` version: '1' services:     man:       build: .       image: my-alpine:latest   ```  Dockerfile: ``` FROM alpine:latest ENV PYTH...