Monday, December 31, 2018

Pandas: groupby and find the most frequent item

Say I have this dataframe:

order_id | class
 1 |  furniture
2  |  book
2  | furniture
2  | book
3  | auto
3  | auto
3  | electronics
3  | pet

and to get the most frequent class of each order:

df.groupby('order_id').agg({'order_id': lambda x: x.value_counts().index[0]})

Wednesday, November 14, 2018

Make a simple heatmap in R with ggplot2

So today I got a file that look like this:












And here's the end result of the heat map:
















(Notice that the order of the Name in the chart is not the same as that in the dataframe.)

Here's the code I used to make this plot:

rm(list=ls())
library(ggplot2)

df = read.csv('fakedata.csv')

# reshape the dataframe
df.m = melt(df, id.vars = 'Name')


ggplot(df.m, aes(variable, Name)) +
  geom_tile(aes(fill = value),
            colour = "white") +
  scale_fill_gradient(low = 'white',
                      high = 'blue4')




This is how the df.m look like:

Tuesday, October 30, 2018

convert a sqlite3 query result table into a pandas dataframe

The sqlite3 query returns a list of tuples. Here's a good way to convert the list of tuples into a pandas data frame.

def q():
    sql_c = " some query commands here"
    df = pd.DataFrame(dbc.execute(sql_c).fetchall())
   
    # get all column names
    cns = [d[0] for d in dbc.description]
    df.columns = cns

    return df

Saturday, October 27, 2018

Monday, September 17, 2018

5 available pane layouts when using tmuxp

 
tmuxp pane layout options:  
 
1. even-horizontal
 
2. even-vertical 
 
3. main-horizontal 
 
4. main-vertical
 
5. tiled

Thursday, July 12, 2018

Add multiple columns to Pandas dataframe by applying a function that returns multiple values

df[['sum', 'difference']] = df.apply(
    lambda row: pd.Series(myfunc(row['a'], row['b'])), axis=1)
# The pd.Series is the key!

Tuesday, May 8, 2018

Python read files line by line

fo = open('inputFileName', 'r')
lines = fo.readlines()
for ln in lines:
    print(ln)

Tuesday, May 1, 2018

To remove parenthesis and all the stuff inside:

AAV (South Africa) Volkswagen
AHT (South Africa) Toyota
AFA (South Africa) Ford
CL9 (Tunisia) Wallyscar

sed -e 's/([^()]*)//g'


AAV     Volkswagen
AHT     Toyota
AFA     Ford

CL9     Wallyscar

Monday, April 30, 2018

my .vimrc file

set nocompatible
        filetype off
        set rtp+=~/.vim/vundle.git/
        call vundle#rc()
        "Bundles:
        Bundle 'gmarik/vundle'
        Bundle 'jiangmiao/auto-pairs'
        Bundle 'Lokaltog/vim-powerline'
        Bundle 'scrooloose/syntastic'
        Bundle 'scrooloose/nerdcommenter'
        Bundle 'scrooloose/nerdtree'
        Bundle 'tmhedberg/SimpylFold'
       "
set number
set relativenumber
syntax on
set background=light
hi clear
syntax enable
colorscheme monokai
set shiftwidth=4
set tabstop=4
set expandtab
set laststatus=2

" searching
set hlsearch
set incsearch
set ignorecase
set smartcase
set showmatch

Friday, April 27, 2018

Python | Process text file and get word counts


import csv
import pandas as pd
import codecs
import jieba
import jieba.analyse

tag_list = list() # list of all tags

with codecs.open('soup.csv', 'r', 'utf-8') as f:
   
    for ln in f:

        item = ln.strip("\n\r").split("\t")

        tags = jieba.analyse.extract_tags(item[0])

        for t in tags:

            tag_list.append(t)

tagS = pd.Series(tag_list)

output = tagS.value_counts(ascending=False)

output.to_csv('output.csv', encoding='gbk')

Thursday, April 26, 2018

Thursday, April 12, 2018

Thursday, April 5, 2018

State names and abbreviations

states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}

Wednesday, March 21, 2018

AWK: add suffix to one column and print out all

Raw data look like:

1 | Field1 | Field2 | Field3
2 | Field1 | Field2 | Field3
3 | Field1 | Field2 | Field3
4 | Field1 | Field2 | Field3
5 | Field1 | Field2 | Field3
6 | Field1 | Field2 | Field3

Want to change it to:

1 | Field1 | Field2$ | Field3
2 | Field1 | Field2$ | Field3
3 | Field1 | Field2$ | Field3
4 | Field1 | Field2$ | Field3
5 | Field1 | Field2$ | Field3
6 | Field1 | Field2$ | Field3

All I need to do is:

cat input.file | awk -F"|"  '$3=$3"$"; print'  >  outcome.file

Saturday, March 3, 2018

This is How You Download Your FitBit Second-Level Data without Coding

If you are a Fitbit user who wants to save a copy of Fitbit data on your computer but doesn’t have advanced programming skills, this tutorial is right for you! You don’t need to do any coding at all to save your second-level data. I have been struggling with getting all the so-called ‘intraday data’ for quite a while. I have found many useful resources online, for example, Paul’s tutorial,  Collin Chaffin’s Powershell module, and the Fitbit-Python API, but they are somehow complicated and I just could not make any of these working for me smoothly.  Recently I finally figured out a way to download these Fitbit data without any coding. Are you ready?

Step 1: Register an app on http://dev.fitbit.com

First, you need to register an account on dev.fitbit.com, and then click ‘MANAGE YOUR APPS’ on the top right area. Next, you need to click ‘Register a new app’ button at the top right area to start with. Well, this step is simple, just make sure the OAuth 2.0  Application Type is set to ‘Personal’, and the Callback URL is complete – including the ‘http://’ part and also a ‘/’ at the end. Here I used ‘http://google.com/’ as an example. I have used this blog’s URL ‘http://shishu.info/’ which worked just fine too.



After you click the red ‘Register’ button, you will be able to see the credentials for the app you just registered. The ‘OAuth 2.0 Client ID’ and ‘Client Secret’ will be used in the next step.



Step 2: Use the OAuth 2.0 tutorial page

Next, you need to right click the ‘OAuth 2.0 tutorial page’ link and open it in a new tab, so that you can look back at your app’s credentials easily. Make sure the ‘Implicit Grant Flow’ is chosen instead of ‘Authorization Code Flow’ – this will make things much easier! After you copy/paste the ‘Client ID’ and ‘Client Secret’ into the blanks and put in the Redirect URI, click the auto-generated link.



Then you will see the confirmation page. Just click ‘Allow’.



And you will be led to the ‘Redirect URI’, which is ‘http://google.com/’ in this case.  But the address bar now shows a very long string which is the token for your app.



Next, you need to copy and paste everything in the address bar but without the starting part ( https://www.google.com/ ) to the ‘Parse response’ section, and hit enter key once. This way you can clearly see what the token is, what the scope is, and how long the token is valid. In this case, the token is valid for 1 week, which equals to 604800 seconds. Pretty good, right?



Step 3: Make the request and get the data!

After you are done with the ‘Parse response’, the next step is ready for you automatically.



Just click the ‘Send to Hurl.it’ link and ‘Launch Request’ on the new page. Make sure to tell the web that you are not a robot too.



After that, if you see the ‘200 OK’ status – meaning everything works fine. And you can find the data you want in the lower half of the page. Like this:



If you click ‘view raw’ at the right side, you will see the ‘BODY’ will be changed to the raw text file and you can simply copy/paste them to a text editor. And that’s all you do to download your Fitbit second-level data! As I promised, you don’t need to know any programming skills to accomplish this, isn’t that cool?

Additional tips:

tip 1: other data types

If you want some other data rather than the ‘user profile’, you can simply change the ‘API endpoint URL’ in the ‘Make Request’ step. According to the Fitbit API documentation, you can get the heart rate data by using this URL:

https://api.fitbit.com/1/user/-/activities/heart/date/today/1d.json

Or the sleep data by using:

https://api.fitbit.com/1/user/28H22H/sleep/date/2014-09-01.json

tip 2: save json file in an easier way

If you think the ‘Send to Hurl.it’ method is not fast enough, you can copy the ‘curl’ command auto-generated in the ‘Make Request’ step and run it in terminal (for Windows, that is the ‘cmd’ window). Add the following part to the end of the copied ‘curl’ command so that the data will be saved to your disk:

>> data_file.json

tip 3: save multiple days’ data automatically

I think Fitbit provides the functionality to let users download multiple days’ data with one command, by specifying the date rage in the curl request. However, I could not make it work for me for some unknown reason. Therefore I used a piece of Python code to download multiple days’ data for me. Here’s the code, which I think is pretty self-explanatory.

import requests
import json
import pandas as pd
from time import sleep
 
# put the token for your app in between the single quotes
token = ''
 
# make a list of dates 
# ref: http://stackoverflow.com/questions/993358/creating-a-range-of-dates-in-python
# You can change the start and end date as you want
# Just make sure to use the yyyy-mm-dd format
start_date = '2015-12-28'
end_date = '2016-06-14'
datelist = pd.date_range(start = pd.to_datetime(start_date),
                         end = pd.to_datetime(end_date)).tolist()
 
'''
The codes below use a for loop to generate one URL for each day in the datelist,
and then request each day's data and save the data into individual json files.
Because Fitbit limit 150 request per hour, I let the code sleep for 30 seconds 
between each request, to meet this limitation.
'''
for ts in datelist:
    date = ts.strftime('%Y-%m-%d')
    url = 'https://api.fitbit.com/1/user/-/activities/heart/date/' + date + '/1d/1sec/time/00:00/23:59.json'
    filename = 'HR'+ date +'.json'
    response = requests.get(url=url, headers={'Authorization':'Bearer ' + token})
 
    if response.ok:
        with open(filename, 'w') as f:
            json.dump(response.content, f)
        print (date + ' is saved!')
        sleep(30)
    else:
        print ('The file of %s is not saved due to error!' % date)
        sleep(30)





Monday, February 26, 2018

Find the speed limiting part of your Python code

Say you have a Python code, 'simple.py' like this:
import time
def func_A():
    time.sleep(1)
def func_B():
    time.sleep(10)
    
if __name__=='__mian__':
    func_A()
    func_B()
and you want to figure out which part takes the most time to finish. What you can do is:
python -m cProfile simple.py
However, the output is kinda too complicated to understand.
A better way is to use 'cprofilev', which you can install from here.
python -m cProfilev simple.py
Output would look like:
[cProfileV]: cProfile output available at http://127.0.0.1:4000
Just go to http://127.0.0.1:4000 and all the information are right there waiting for you.
## https://github.com/ymichael/cprofilev

Tuesday, February 6, 2018

Round to thousands | AWK

function rounding(var1){
    a = var1+500
    b = int(a/1000)
    return b
}

BEGIN{
    FS="|"
    OFS="|"
}
{   
     print $1,$2, $3, rounding($4), rounding($5), rounding($6), rounding($7), rounding($8),rounding($9),rounding($10),rounding($11),rounding($12)
}

Monday, January 29, 2018

Restart Postgres Service

Go to :
C:\Program Files\PostgreSQL\10\bin
Run:
pg_ctl restart -D "C:\Program Files\PostgreSQL\10\data"

Thursday, January 11, 2018

A self-defined algorithm to group strings | Python

def group_names(x):
    d1 = dict()
    for wx in x:
        dist_list = [Levenshtein.distance(wx, w2) for w2 in x]
        indx = [d<=4 for d in dist_list]
        sub_lst = list(compress(x, indx))
        list_new = [e for e in x if e not in sub_lst]
        x = list_new
        print len(x)
        if len(sub_lst)>1:
            for i in sub_lst[1:]:
                d1[i] = sub_lst[0]
    return d1


The problem is that when the input list (x) is too long, it takes quite a while to finish.

Wednesday, January 10, 2018

Replace multiple characters in a string | Tableau

Just use nested REPLACE() function:

replace(replace(replace(replace(replace(lower([Employer Name]), "inc",""), "-",""), "&",""), " co","")," llc","")

my-alpine and docker-compose.yml

 ``` version: '1' services:     man:       build: .       image: my-alpine:latest   ```  Dockerfile: ``` FROM alpine:latest ENV PYTH...