MATLAB ... and more ...: 2015

Thursday, November 5, 2015

Creating NaN vector in MatLab

I know that these two functions: zeros() and ones() can be used to create a series of '0's and '1's in MatLab. For example:

a = zeros(5,1)
b = ones(7)

I just learned about another function today: NaN(). This function can create a vector or matrix with all the elements equals to nan. This is useful if you want to add a new empty column to a dataset, or table.

c = NaN(4,5)

Thursday, October 1, 2015

Aggregate R dataframe with multiple conditions and multiple outputs

t_ag = aggregate(cbind(CPC, PM25)~SiteID+Date+Session+stType2,
data = t,
FUN = mean)

Thursday, September 24, 2015

Data Wrangling with MongoDB Lesson 6.11 Improving Street Names

"""
Your task in this exercise has two steps:

- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix
the unexpected street types to the appropriate ones in the expected list.
You have to add mappings only for the actual problems you find in this OSMFILE,
not a generalized solution, since that may and will depend on the particular area you are auditing.
- write the update_name function, to actually fix the street name.
The function takes a string with street name as an argument and should return the fixed name
We have provided a simple test so that you see what exactly is expected
"""
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

OSMFILE = "example.osm"
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
"Trail", "Parkway", "Commons"]

# UPDATE THIS VARIABLE
mapping = { "St": "Street",
"St.": "Street",
"Rd.": "Road",
"Ave": "Avenue",
"N." : "North"}

def audit_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)

def is_street_name(elem):
#print elem.attrib['k']
#print elem.attrib['v']
return (elem.attrib['k'] == "addr:street")

def audit(osmfile):
osm_file = open(osmfile, "r")
street_types = defaultdict(set)
print street_types
for event, elem in ET.iterparse(osm_file, events=("start",)):

if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_street_name(tag):
audit_street_type(street_types, tag.attrib['v'])

return street_types

def update_name(name, mapping):
# YOUR CODE HERE
betterNameParts= []

name = name.split()
for index, part in enumerate(name):
if part in mapping.keys():
betterNameParts.append(mapping[part])
else:
betterNameParts.append(part)

betterName = ' '.join(betterNameParts)

return betterName

def test():
st_types = audit(OSMFILE)
assert len(st_types) == 3
pprint.pprint(dict(st_types))

for st_type, ways in st_types.iteritems():
for name in ways:
better_name = update_name(name, mapping)
print name, "=>", better_name
if name == "West Lexington St.":
assert better_name == "West Lexington Street"
if name == "Baldwin Rd.":
assert better_name == "Baldwin Road"

if __name__ == '__main__':
test()

Wednesday, September 23, 2015

numpy.ndarray doesn't have count attribute

First convert it into a list, by using list() function. Then use the count method to count.

list(nda).count(1)

Tuesday, September 15, 2015

Python code to process CPC 3007 data

# this code process the one minutes cpc data and convert it to each seconds

import xlrd, xlwt, datetime
from array import array

fileName1 = '06082015.xlsx'
f1WorkBook = xlrd.open_workbook(fileName1)

#check sheet names
sheetName = f1WorkBook.sheet_names()
# select the first sheet
f1Sheet1 = f1WorkBook.sheet_by_index(0)
sn = 19
en = 147
#copy and paste the header lines, row 0 to 17
header = [f1Sheet1.row(r) for r in range(18)]
#header = [f1Sheet1.cell_value(row, col) for col in range(2) for row in range(18)]
print header[0]

### ----------This section deals with header---------------------------
newWorkBook = xlwt.Workbook()
newsheet = newWorkBook.add_sheet('processed_'+fileName1[0:8])

for rowIndex, rowValue in enumerate(header):
for colIndex, cellValue in enumerate(rowValue):
newsheet.write(rowIndex, colIndex, cellValue.value)

###-------This part deals with the time column---------------------------------
time = [f1Sheet1.cell_value(row, col) for col in range(2) for row in range(sn-1,en)]
newTime = [0]*((en-sn)*60) # create an empty vector
newTime[0] = time[0]
print newTime[0]

for i in range(0, en-sn):
for j in range(0,60):
#if j+60*i <70: font="">
# print j+60*i
newTime[j+60*i] = newTime[0]+(j+60*i)/(1440*60.0)

for rowIndex in range(18, 18+(en-sn)*60):
#print rowIndex
newsheet.write(rowIndex, 0, newTime[rowIndex-18])

###---------This section deals with the concentrations -----------------------

conc = [f1Sheet1.cell_value(row,col) for col in range(1,2) for row in range(sn-1, en)]
newConc = [0]*((en-sn)*60)
for i in range(0, en-sn):
for j in range(0,60):
newConc[j+60*i] = conc[i]

for rowIndex in range(18, 18+(en-sn)*60):
newsheet.write(rowIndex, 1, newConc[rowIndex-18])

### ------------------Save file
newWorkBook.save('processed_'+fileName1[0:8]+'.xls')

Tuesday, June 9, 2015

A simple example of mixed effects model using simulated data in R

First let's make the dataframe.

rm(list=ls())

library(lme4)

set.seed(1)

ID = c(rep('A',10), rep('B', 10), rep('c', 10))

x = c(1:10)

ya = x + 1 + rnorm(10)

yb = x + 2 + rnorm(10)

yc = x + 3 + rnorm(10)

df <- data.frame="" id="" x="rep(x,3)," y="c(ya,yb,yc))</font">

Then let's do liner regression.

mdl.lm <- span=""> lm(y~x, data=df)

coef(mdl.lm)

## (Intercept)           x 

##   
2.073487    1.001631

The linear regression cannot treat each ID seperately. It provides an overall intercept (2.073487) and overall slope (1.001631) for all the 30 data points.

Next, let's build a random-intercept mixed effects model.

mdl.mix.RI <- span=""> lmer(y~x+(1+1|ID),data=df, REML=F)

coef(mdl.mix.RI)

## $ID

##  
(Intercept)        x

## A   
1.287210 1.001631

## B   
2.211162 1.001631

## c   
2.722090 1.001631

## 

## attr(,"class")

## [1] "coef.mer"

The above model can treat each ID differently, so the Intercept for each ID are different.

Last, let's build a random-intercept and random-slope mixed effects model.

mdl.mix.RIS <- span=""> lmer(y~x+(1+x|ID),data=df, REML=F)

coef(mdl.mix.RIS)

## $ID

##  
(Intercept)         x

## A  
0.9373545 1.0650834

## B  
2.2293276 0.9929275

## c  
3.0537802 0.9468823

## 

## attr(,"class")

## [1] "coef.mer"

This model also treat each ID differently, and it gives different intercept and slope for each ID.

Thursday, May 21, 2015

r apply function to list

http://stackoverflow.com/questions/30339652/r-apply-correlation-function-to-a-list/30340905#30340905

Tuesday, May 19, 2015

annotate p value in italicized font in ggplot

label = substitute(italic(p) == a, list(a=0.371) )
ggplot() + annotate('text', label=as.character(as.expression(label)), x=10, y=10, parse=T)

Wednesday, April 29, 2015

Relationship between Pearson correlation coefficient and the r-squared from linear regression

Simply put, if you take the square root of the r-squared value, which can be easily obtained in Excel, the outcome equals to the Pearson correlation coefficient.

Here's my R code to test this:

x<-rnorm c="" font="" nbsp="">
y<-rnorm c="" font="">
mdl <- font="" lm="" x="" y=""># linear regression model
r = summary(mdl)$r.squared # the r-squared value from linear regression
rp = cor.test(x,y, method='pearson')$estimate # rp is the Pearson correlation coefficient
sqrt(r) == rp

Wednesday, February 11, 2015

China PM2.5 & PM10 historical data repository

http://pan.baidu.com/s/1dD8KdpB

More data will be added to this repository in the future.

MATLAB ... and more ...