Command-Line Worldview - John Dennison

RSS

Posts tagged with "R"

Simple Parallel randomForest with doMC package

I have been exploring how to speed up some of my R scripts and have started reading about some amazing corners of R. My first weapon was the Rcpp and RcppArmadillo package. These are wonderful tools and even for someone that has never written c++ before, there are enough to examples and documentation to get started. I hope to share some of these approaches soon. The second and the topic of this missive is parallelization through the package doMC.

When answering a question on stats.stackoverflow, I got to share a simple use case with the randomForest library. I wanted to share that here as well. First a little background on why this works. randomForest is an ensemble technique that trains a group of decision trees on random subset of a training set( a group of trees, forest get it). A new record is presented to this group and what ever class the preponderance of underlying decision trees choose is the classification. For this example, one of the most important aspects of this algorithm is that each decision tree is independently trained on a random subset of variables and records. When in the parallel mindset independent is next godliness, what it means is that you can spread the training of a forest over multiple cores/machines. Below is a simple one machine example.

library("doMC")
library("randomForest")
data(iris)
 
registerDoMC(4) #number of cores on the machine
darkAndScaryForest <- foreach(y=seq(10), .combine=combine ) %dopar% {
   set.seed(y) # not really needed
   rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
}

I passed the randomForest combine function(which will stitch independently trained forests together)  to the similarly named .combine parameter( which controls the function on the output of the loop. The down side is you get no OOB error rate or more tragically variable importance. But this will speed up your training. 

Re-approaching problems with parallelization in mind is a tough problem but i have seen some very real improvements in my speed by doing so. 

R/Python Web Apps

I have a little delinquent on this whole blogging thing but here is a talk I gave to the DC R Group. On a twisted and Rpy2 web application framework that I built for my company. Enjoy

http://bit.ly/NW0Neg

J

Jan 4

Long running R commands: unix screen, nohup and R

I wanted to write a quick post about a useful linux tool for using R. I sometimes have long running R sessions for model training. One solution that I have alluded to is to call your script with Rscript.

The nohup command in linux with push this to the back ground. Even adding mutt(or other email program) to the bottom of your script is extremely useful.

You can save the details of the training into output.txt and then place this command in your R file

system(mutt -s “Master, this is an email from HAL 9000, your commands have been completed” john@foo.com < ~/R/output.txt)

Then you can call the script with nohup Rscript foo.R &

The nohup command runs a command in the background and the ampersand is unix for return the prompt. This will redirect all standard output to a file nohup.out file placed in the directory you called it. However, if you have multiple R file running and one breaks the error gets lost in the common log. Total mess, trust me. The solution is to redirect the output log to individual files with this little thing:

nohup Rscript foo.R > ~/R/logs/foo.out 2>&1 &

This redirects the standard out to a file ~/R/logs/foo.out as well as place all standard error there as well. that is what: 2>&1 is doing.

This is all well and good but we know that most R is written interactively not in nifty self contained scripts. To do this linux has a great program call screen. I and not going to get too fancy with screen but a it has been extremely useful to me. Again i am sshing to remote reserved servers or EC2 instances but this is also useful on a home machine for kicking of an R command and going back to what ever you do on you machine. (ANSII porn or http://xkcd.com/981/). The command is:

screen -S R

This will open a seemingly identical terminal prompt. From here use R(type R…) as you would. When you hit a long running process. Hit Ctrl-a d. This is the screen command to access all screen commands. Ctrl-a d will bring you back to your home prompt. Ctrl-a then typing :quit will exit the screen. When you want to get back, you can type:

screen

Now the -S R options you typed before are optional and if you only have one screen open. I have many screens running (mainly interactive sessions to my databases or ssh to EC2 machines) and getting back to them is a pain unless you name the screens the -S FOO will open a screen called FOO. To reopen type:

screen -Dr FOO

or in our case:

screen -Dr R

The D arg will detach the screen if you exited in a less then graceful manner.

One more trick: if you paste the R train command and the above system command email system() as one text block in the R screen session, then detach you will receive a lovely email when you are done.

Good Luck hacking this the New Year!!!

John

Lehmann Primiality Test in R

After a small frustration with the limitations of R 16-bit intergers. I think i have worked out a solution. The issue was i was implementing a bone-headed solution to the modulo/exponent step. So to recap. The Lehmann primality test is a probabilistic prime test. I got the implementation from this article http://davidkendal.net/articles/2011/12/lehmann-primality-test

the basic logic is to take n

1. get a random number a = between 1 and n-1

2. calculate x =(a^(n-1)/2) %% n

3. if x == 0 or x == (-1 %% n) then the number could be prime

The issue happened when calculating x =(a^(n-1)/2) %% n. For even n < 100 the number in this function gets big. So big that the limitation on R integer lengths rounds the number and then the modulo function is meaningless. The way to fix this is to use modular exponentiation. (Thanks to stackoverflow  for bringing up the solution). This allows us to iteratively multiply the  base by the exponent and store the modulo for every step. By doing this we never have to deal with a number so large as to screw with the R limitations.There are some clever highly efficient modular exponentiation algorithms but i am no computer scientist so we will use the simplest(modular exponentiation with modular multiplication)

Here is the full solution:

modexp<-function(a, b, n){
    r = 1
    for (i in 1:b){
        r = (r*a) %% n
    }
    return(r)
}
 
 
primeTest <- function(n, iter){
   a <- sample(1:(n-1), 1)
    lehmannTest <- function(y, tries){
      x <- modexp(y, (n-1)/2, n)   
    if (tries == 0) {
      return(TRUE)
            }else{          
      if ((x == 1) | (x == (-1 %% n))){
        lehmannTest(sample(1:(n-1), 1), (tries-1))
        }else{
        return(FALSE)
      }
    }
  }
   if( n < 2 ){
     return(FALSE)
     }else if (n ==2) {
       return(TRUE)
       } else{
         lehmannTest(a, iter)
         }
}

This type of iterative exponentiation is implemented in Python under pow() so a python implementation is quite easy to build as well. Enjoy. R can get frustrating but a little searching around can get you a far way.

Cheers

J

I may have ventured into the first circle and Virgil Is NOT Here — Lehmann Primality Test in R

So a post about Lehmann Primality Tests in HackerNews came across my twitter this morning. Seemed like a great quick coding exercise during lunch. Little Did I know…It is wonderfully simple primality test but i think i have hit up against some R quirks. Here is my code for the test (thanks to David Kenal at the aforementioned link of the implementation in Scheme, this is very much his logic):

the basic logic is to take n

1. get a random number a = between 1 and n-1

2. calculate x =(a^(n-1)/2) %% n

3. if x == 0 or x == (-1 %% n) then the number could be prime

The more you repeat these three steps the more certain you are. So you call it recursively. So the function takes to arguments the number is question and the number of iterations(higher the more certain). This implementation seemed to work.

However… I started testing larger primes and things go terribly wrong. you get incorrect an answers and an ominous sounding message:

Warning message:
probable complete loss of accuracy in modulus 

I believe the issue comes down to floating point conversions. AKA the first circle of the R inferno. (which is a great read and for a dante fan and a nerd a wonderful collisions of worlds)

say for n<-97

> (y^((n-1)/2))
[1] 2.730792e+64

That is a big Number thanking the mod of it doesn’t go so well. above a certain point the mod function breaks down because of floating point rounding. I don’t know if there is an easy solution but i wanted to share my experiences as and if you are “Midway on our life’s journey, I found myself in dark woods, the right road lost” always look to R inferno for help.

Here is a loop to test:

prime_test<-c(2,3,5,7,11,13,17 ,19,23,29,31,37)

for (i in 1:length(prime_test)) {
  print(primeTest(prime_test[i], 50))
}

Poor, Poor Hillary

This will be the last baby name related post but this came out of part two web scrapping post last month. I was looking for the fastest rising Names. I flip the logic and looked for the fastest declining names in relative popularity.

Out of that exercise this gem fell out.

image

Here is the messy ggplot code:

nm<-"Hillary"
p<-ggplot(names[which(names$Female %in% nm),],aes(x=Year,y=Rank)) 
p<- p + ylim(max(names$Rank),min(names$Rank)) 
p<- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p<- p + opts(title = "Poor, Poor Hillary: Popularity of the Female Name 'Hillary' \n But we already that the Clinton Prseidency was better to certain women")
matrix.label<-data.frame(Year = 1996 , Rank = 100, Text = "Clinton \n Presidency") # create the custom on graphic text label, kind of a hack
p <- p +  geom_rect(aes(xmin = 1992 , xmax = 2001 , ymin = 1000 , ymax = 1 ),fill = "Blue", alpha = .002) 
p <- p +  geom_rect(aes(xmin = 2007 , xmax = 2008 , ymin = 1000 , ymax = 1 ),fill = "Green", alpha = .002) 
matrix.label2<-data.frame(Year = 2007 , Rank = 100, Text = "Hillary's \n Presidential \n Run")
 
# 1993 ->2001 the years of the Clinton Presdidency but 1992 was when he won the election so that is our starting point
p <- p + geom_text(data = matrix.label, aes(label = Text, size =3))
p <- p + geom_text(data = matrix.label2, aes(label = Text, size = 3))
p

Created by Pretty R at inside-R.org

For the Data look at my github project https://github.com/jofusa/ssa-baby-names

SVN Version Control, R, and some rambling thought on AWS,Rscripts

I do a alot of my modelling on Rstudio hosted on EC2 instances. If you don’t use, I would highly recommend. A brilliant tool. Kudos to the Rstudio team. I have made a personal and professional pledge to obsessively use version control. I hope to show a quick example of how to use version control in the modelling context, even if you are not tweaking the linux kernel. I know Rstudio has version control as an enhancement very soon and i am eagerly awaiting its release, like a virgin on prom night. (perhaps given the nerdy crowd that could bring up painful memories).

The first and obvious use of version control is to keep track of scripts over time. This is exactly what it is designed to do. Even if you came to R as a non-programmer(like me) it good to adopt the best parts. If you are at all familiar with version control this should make plenty of sense. If you are not here are some better high level and tutorials better then i could write:

http://chronicle.com/blogs/profhacker/a-gentle-introduction-to-version-control/23064

More technical introduction svn:

http://svnbook.red-bean.com/

Great Git intro(I professinoally use SVN and just started Git, the jump can be a little steep):

http://library.edgecase.com/git_immersion/index.html

Now to why you are reading this article, the R. So you crunch an afternoon, add some clever backflip to your R project, you should save it, then flip over to the command line and commit to you repo. This hopefully will get even easier with aforementioned Rstudio enhancement. However what about your models, the actual R objects. If you are just exploring and hacking a data set then the longevity of your models are not too important, you got the insight and you keep modeling. However, when you are constantly tweaking and comparing similar models then tracking these changes is incredibly important.

library(randomForest)
seed<-123
set.seed(seed)
 
data(iris) #Load Data
 
start<-Sys.time()
iris_rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
end<-Sys.time()
time<-(end-start)
 
#Save the model
save(iris_rf, file ="~/R/model/iris_rf")
 
#Now we add the the SVN Repo
system("svn add ~/R/model/iris_rf")
 
#Build Comment and Commit
svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="")
 
eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 
#Now get SVN Revision Number     
model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) 
#This awk command grabs the version number, 
#the 'intern =TRUE' redirects the output of the system command to your object instead of stout
#a brilliant useful tool on a linux system
print(model_ver)
#Now we have our revision number

This is clearly a toy example but it illustrates the use. Because I do a lot of database development, something is not useful to me until it stored in a database. That depends on what you are going to use them models for but I like to store out of sample error rates, cross validation results, the specific parameters used in each model run and of course the revision number of the SVN repo. That ties it all together. Now for a further example say we want to tweak the example.

args1 <- commandArgs(TRUE)[1]
library(randomForest)
seed<-args1
set.seed(seed)
 
system("svn update ~/R/model/")
load("~/R/model/iris_rf")
 
data(iris) #Load Data
 
start<-Sys.time()
iris_rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
end<-Sys.time()
time<-(end-start)
 
#we are combining the old model with the new model to add trees to the forest
rf.all <- combine(iris_rf,iris_rf2)
 
#Save the model
save(rf.all, file ="~/R/model/iris_rf")
 
#Now we add the the SVN Repo
system("svn add ~/R/model/iris_rf")
 
#Build Comment and Commit,
svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="")
 
eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 
#Now get SVN Revision Number     
model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) 

I have added a little tweak to the top of the this script the really opens up the power of this approach.

commandArgs(TRUE)[1]

This functions parses command line arguments so they can be used within your R script. Say we saved this file as rf_Iris.R. you could call it from the command line or from within a shell script(where this REALLY gets fun, ok maybe i have a tainted idea of whats fun) like:

Rscript ./rf_Iris.R 456

Now this example of resetting the seed is a little toy, however you could pass in the target variable, model name for rerunning/updating, parametrize a sql call. It get cool i promise you. It turns R from a interactive scripting tool to a batch process. I will post some really cool shell scripts that allow you to spin up EC2 machines to remotely execute Rscripts on the cloud. Combining the randomForest Combine technique used above and a couple EC2 machines you can build a distributed training ‘cluster’. Or more accurately described by Mike Driscoll as “Bash Reduce”. At minimum you can expand your computing power from the comfort of your own command line. (as well as really useful story to pick up girls at the bar. little know fact, talk of distributed cloud based modelling really gets the women).

This framework can be used to dynamically build R statements using command-line arguments. I also like the three line split because you can just run the paste command to see what would run. It is totally kludgey the more complicated the statement it can really get difficult to debug. If someone has a more elegant solution i would love to hear.

eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 

A bit of WARNING: this type of concatenate command building should get your DBA/SysAdmin alarms going off. You should not expose any script that is this hack to web server or let i run by a potentially malicious soul on your machine. It is just asking for a sting injection attack. If you planning to access these scripts through a rApache or make them public facing, you should NOT be concatenating system commands. If you are skilled enough to do that, then you probably don’t need me to tell you.

Cheers

Ulam Spirals in R and ggplot

Having seen a twitter post speed by about Ulam Spirals I started to read up.  As the story goes in 1963 Stanislaw Ulam was bored at conference and started scribbling numbers in a spiral. What he discovered was a strange diaginal pattern of Prime Numbers. I was a little bored as well, so i decided to write up some R to plot these patterns.

Here is the main function. I am sure there are more elegant ways of calculating a Ulam spiral but here is my hack. First we calculate the diagonals and then fill in the gaps.

 
Ulam.Spiral<-function(N){
  if (even(N)){
     cat(sprintf("Error: function only accepts odd integers because it a poorly written and fragile piece of code.\n"))
  }else{
m <- matrix(NA, nrow=(N), ncol=N)
    top.left<-c(1,1)
    bottom.right<-c(N,N)
    top.right<-c(1,N)
    bottom.left<-c(N,1)
    n<-N
    a<-N
    m[median(1:N),median(1:N)]<-1
 
while(a>=3){
# This is an adaptation of a Euler Problem 28 solution. It calculates the diaginals.
    m[bottom.right[1],bottom.right[2]]<- a^2 #bottom right
    m[top.left[1],top.left[2]]<- a^2 - 2*a + 2 # top left
    m[top.right[1],top.right[2]]<- a^2 - 3*a + 3 # top right
    m[bottom.left[1],bottom.left[2]]<- a^2 - a + 1 #bottom left
 
#Both Horizontal Rows
    m[top.left[1],(top.left[2]+1):(top.right[2]-1)]<-seq( m[top.left[1],top.left[2]]-1,m[top.left[1],top.right[2]]+1) #Fill in top row
    m[bottom.left[1],(bottom.left[2]+1):(bottom.right[2]-1)]<-seq(m[bottom.left[1],bottom.left[2]]+1, m[bottom.left[1],bottom.right[2]]-1) #Fill in bottom row
#Left Vertical Rows
    m[(top.left[1]+1):(bottom.left[1]-1), top.left[1]] <- seq(m[top.left[1],top.left[1]]+1, m[bottom.left[1],top.left[1]]-1)#Left Hand Row
#Right Verical Row
   m[(top.right[1]+1):(bottom.right[2]-1),top.right[2]] <-seq(m[top.right[1],top.right[2]]-1 , m[top.right[1],top.right[2]] - (a-2) )
#drop down on square and repeat
    a<-a-2
    top.left<-c(top.left[1]+1,top.left[2]+1)
    bottom.right<-c(bottom.right[1]-1,bottom.right[2]-1)
    top.right<-c(top.right[1]+1,top.right[2]-1)
    bottom.left<-c(bottom.left[1]-1,bottom.left[2]+1)
}
return(m)
}
}

Created by Pretty R at inside-R.org

Here is the output for a 5X5 spiral

     [,1] [,2] [,3] [,4] [,5]
[1,]   17   16   15   14   13
[2,]   18    5    4    3   12
[3,]   19    6    1    2   11
[4,]   20    7    8    9   10
[5,]   21   22   23   24   25

The code is on git hub. https://github.com/jofusa/ulam-spirals-R

Many Thanks to http://librestats.wordpress.com/2011/08/20/prime-testing-function-in-r/ for a useful IsPrime() function and saving me some time.

Here is an example Plot- Prime Number are in red:

Popular Baby Names Walk-Through Part 2 - Graphing the fast movers

I will assume you have read through part 1 and have the csv file loaded. While we covered some basic graphing in the last post i hope to get into a little more of the data crunching. Specifically I am interested in the names which where driven by a specific cultural phenomena. There are clearly decade long trends where “Beth” is fading in popularity, but you can not point to a specific event or figure in the public consciousness that is driving this. Enough talk let me show you what I mean. Here is the chart for the female name Trinity. Hat Tip to the google dev python people for pointing this out.

nm<-"Trinity"
p<-ggplot(names,aes(x=Year,y=Rank)) 
p<- p + ylim(max(names$Rank),min(names$Rank)) 
p<- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p<- p + opts(title = "People Liked the Matrix Way Too Much")
matrix.label<-data.frame(Year = 1985 , Rank = 220, Text = "Matrix Released - 1999") # create the custom on graphic text label
p <- p +  geom_rect(aes(xmin = 1998 , xmax = 2000 , ymin = 1000 , ymax = 1 ),fill = "Green", alpha = .002)
p <- p + geom_text(data = matrix.label, aes(label = Text))
p

You can see here that some people loved the Matrix enough to brand their child for life. As you can see this rapid rise in popularity in a one or two year period can more easily be attributed to a particular social phenomena. Let take a first stab at this. What are the female names that have the greatest change in our data.

#This would be more elequently done with Hadley Wickam's ddply. The lapply/do.call("rbind") combo is brillinatly useful and for simple things I use
name.min.max<-function(nm){
data.frame(
  name = nm,
  min = min(names[which(names$Female == nm),2]),
  max = max(names[which(names$Female == nm),2]),
  dif = max(names[which(names$Female == nm),2]) - min(names[which(names$Female == nm),2])
  )
}
name.list<-unique(names$Female)
out<-lapply(X = as.list(name.list), FUN = name.min.max)
out <- do.call("rbind", out)
female.dif<-out[order(-out$dif),]
#Top Female Names with greatest change overtime # Need to seperate the winners and losser just plots largest difference
nm<-as.character(female.dif[1:10,1])

What we have done here is built a simple name.min.max function that returns a data frame of for a female name passed in. the lapply() function iterates over a list of distinct Female Names in name.list and returns a list of data frames(one fr each name). do.call(“rbind”,out) stitches them all back together. Then we order by the greatest difference a grab the top ten.

> female.dif[1:10,]
name min max dif 469 Ava 4 999 995 1232 Samantha 3 998 995 915 Tammy 8 994 986 59 Debra 2 982 980 29 Cheryl 13 992 979 1257 Kayla 11 988 977 38 Kathy 14 990 976
 22 Janice 21 996 975 738 Lily 18 993 975 1844 Taylor 6 978 972

Here is the result lets graph it with a facet for each name.

p<-ggplot(names,aes(x=Year,y=Rank)) 
p <- p + ylim(max(names$Rank),min(names$Rank)) 
p <- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p <- p + opts(title = "Top Movers")
p <- p + facet_wrap(~Female)
p

We definitely found names that changed over time but we have names that are rising over time and falling. Also our new name.min.max function doesn’t give us any insight about rate of change. Is it a slow change like Lily or a dramatic rise like Kayla. As we said before names with a dramatic rise over a short period are more likely associated with something in the pop culture of the time.

What we are most interested is the names that jumped the most over a one year period. I tried a couple different techniques to get this and im going to walk through what i think to be the best. I would love to see other people’s approach. To calculated this we are going to use the function Delt() from the package quantmod. Quantmod is a package for testing financial models. If you actually want to make god money with R, stop reading about baby names and go build a trading platform. Delt is used to calculate change in stock prices over time. i.e. 2% up since yesterday. So load up quantmod and see what happens.

library(quantmod)
female.names<-names[,c(1,4,2)]
female.names<-female.names[order(female.names$Female, female.names$Year),]
female.names$delta <- Delt(female.names$Rank)
female.names[1540:1580,]      

      Year Female Rank Delt.1.arithmetic
44018 1994 Alexis   18      -0.333333333
45014 1995 Alexis   14      -0.222222222
46008 1996 Alexis    8      -0.428571429
47008 1997 Alexis    8       0.000000000
48006 1998 Alexis    6      -0.250000000
49003 1999 Alexis    3      -0.500000000
50006 2000 Alexis    6       1.000000000
51005 2001 Alexis    5      -0.166666667
52005 2002 Alexis    5       0.000000000
53007 2003 Alexis    7       0.400000000
54011 2004 Alexis   11       0.571428571
55013 2005 Alexis   13       0.181818182
56014 2006 Alexis   14       0.076923077
57019 2007 Alexis   19       0.357142857
58015 2008 Alexis   15      -0.210526316
59013 2009 Alexis   13      -0.133333333
41952 1991 Alexus  952      72.230769231
42616 1992 Alexus  616      -0.352941176
43354 1993 Alexus  354      -0.425324675
44250 1994 Alexus  250      -0.293785311
45234 1995 Alexus  234      -0.064000000
46178 1996 Alexus  178      -0.239316239
47182 1997 Alexus  182       0.022471910
48184 1998 Alexus  184       0.010989011
49228 1999 Alexus  228       0.239130435
50252 2000 Alexus  252       0.105263158
51281 2001 Alexus  281       0.115079365
52317 2002 Alexus  317       0.128113879
53370 2003 Alexus  370       0.167192429
54425 2004 Alexus  425       0.148648649
55529 2005 Alexus  529       0.244705882
56547 2006 Alexus  547       0.034026465
57627 2007 Alexus  627       0.146252285
58719 2008 Alexus  719       0.146730463
59895 2009 Alexus  895       0.244784423
48992 1998 Alexys  992       0.108379888
49821 1999 Alexys  821      -0.172379032
50842 2000 Alexys  842       0.025578563
51837 2001 Alexys  837      -0.005938242
52909 2002 Alexys  909       0.086021505
53958 2003 Alexys  958       0.053905391

We subsetted the female names into their own object, ordered by name and year(this is important when calculating a rolling change). We opened up a slice of the new object and there are some problems. First is Delt() function is returning arithmetic differences(that is the difference in the two observation divided by the first). This would be a great help if the numbers it was calculating were truly continuous. Remember, these are ordinal rankings.

1998 Alexis    6      -0.250000000
1999 Alexis    3      -0.500000000

This change in Alexis ranking between 1998 and 1999 is being a calculated as a 50% increase. This isn’t strictly meaningful because what we want to calculate is the absolute change. This brings up my second favorite part of R. If you have a problem with how function is behaving, just go change the function.

we can download the source file for quantmod:

wget http://cran.r-project.org/src/contrib/quantmod_0.3-17.tar.gz

extract the tar file and grep the function you care about:

tar -zxvf quantmod_0.3-17.tar.gz

grep Delt ./quantmod/R/*

We get a R source file named ./quantmod/R/OHLC.transformations.R that contains the function we care about. Here is the original definition:

`Delt` <-
function(x1,x2=NULL,k=0,type=c('arithmetic','log'))
{
    x1 <- try.xts(x1, error=FALSE)
    type <- match.arg(type[1],c('log','arithmetic'))
    if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length');
    if(is.null(x2)){
        x2 <- x1 #copy for same symbol deltas
        if(length(k) < 2) {
            k <- max(1,k)
        }
    }
    dim(x2) <- NULL  # allow for multiple k matrix math to happen
    if(type=='log') {
        xx <- lapply(k, function(K.) {
                log(unclass(x2)/Lag(x1,K.))
              })
    } else {
        xx <- lapply(k, function(K.) {
                unclass(x2)/Lag(x1,K.)-1
              })
    }
    xx <- do.call("cbind", xx)
    colnames(xx) <- paste("Delt",k,type,sep=".")
    reclass(xx,x1)
}

There is a default type parameter that if not set will apply to the row.

unclass(x2)/Lag(x1,K.)-1

This is how the change is normalize. what we want is:

unclass(x2) - Lag(x1,K.)  

What we can do is add another type option(absolute) and add an else if to handle it. We can run this to define a new function and apply it to the data:

`Delt.Absolute` <-
function(x1,x2=NULL,k=0,type=c('arithmetic','log'))
{
    x1 <- try.xts(x1, error=FALSE)
    type <- match.arg(type[1],c('log','arithmetic', 'absolute'))
    if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length');
    if(is.null(x2)){
        x2 <- x1 #copy for same symbol deltas
        if(length(k) < 2) {
            k <- max(1,k)
        }
    }
    dim(x2) <- NULL  # allow for multiple k matrix math to happen
    if(type=='log') {
        xx <- lapply(k, function(K.) {
                log(unclass(x2)/Lag(x1,K.))
              })
    } else if (type=='absolute') {
        xx <- lapply(k, function(K.) {
                unclass(x2) - Lag(x1,K.)  
              })
    } else {
       xx <- lapply(k, function(K.) {
                unclass(x2)/Lag(x1,K.)-1
              })
 
    }
    xx <- do.call("cbind", xx)
    colnames(xx) <- paste("Delt",k,type,sep=".")
    reclass(xx,x1)
}
 
female.names<-names[,c(1,4,2)]
female.names<-female.names[order(female.names$Female, female.names$Year),]
female.names$delta <- Delt.Absolute(female.names$Rank,k=1, type = "absolute")
female.names[20:30,]
      Year  Female Rank Delt.1.absolute
28913 1978   Aaron  913             -33
29990 1979   Aaron  990              77
30883 1980   Aaron  883            -107
31956 1981   Aaron  956              73
32969 1982   Aaron  969              13
33881 1983   Aaron  881             -88
49978 1999 Abagail  978              97
50954 2000 Abagail  954             -24
51920 2001 Abagail  920             -34
52882 2002 Abagail  882             -38
53868 2003 Abagail  868             -14

now the function returns an absolute change. We still have an issue. look at the difference calculated between Aaron and Abagail. It is calculating between names, not good. To solve im going to use a trick I read on SO.

female.names[c(TRUE, female.names$Female[-1] != female.names$Female[-length(female.names$Female)]), 4] <- NA
female.names<-female.names[order( female.names[4]),]
 

This elegant line of code take the first instance of all the names in our data frame and replaces the Delt col with a NA(R’s missing data handler). There is some tweaking if you want to use it on numeric data but it is a clever piece of code. After we reorder and extract the top ten names with greatest one year positive change(which is actually a negative number from our Delt,Absolute function, think a change in Rank 10 -> 2 is actually -8 change). Lets plot these top ten.

female.names.winners<-female.names[order(female.names[4]),]    
top.female.names<-unique(female.names.winners[1:12,2])
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank)) + geom_line(data = names[which(names$Female %in% top.female.names),], aes(group=Female, colour = Female), alpha = 1, size = 1)+ opts(title = "Female Baby Name Popularity Since 1950")+ opts(axis.text.x=theme_text(angle=-70),hjust=0) + facet_wrap(~ Female)
p

As you can see there are some names that flash up but do not stick around. However Some names shoot up and stayed popular. Take Desiree for example. Lets take a closer look.

p<-ggplot(names,aes(x=Year,y=Rank)) 
p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis
nm <- "Desiree" # enter a female name
p <- p  + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm)
p

Now I have never seen this movie but clearly Brando has an effect. Desiree (1954)

I hope you found some of this useful. There are some fun association that can be found. Have fun. Next Up fun with Baby Names Part 3: Biggest Losser’s

Popular Baby Names Walk-Through Part 1 - Web Scrapping and ggploting

This is the first walk-through I have posted. Reading these types of posts has been incredibly helpful as I have been learning R and other useful tools in the Unix universe. Hopefully you find it helpful.

First, I have been watching Google Python Videos the last couple days and they have a coding assignment using Social Security Administration Data Baby Names. Not having the downloads for the course I thought it would be a good python exercise to try to get the same data. So, my interest in baby names has nothing to do with any impending life decision(or any recent drunken decisions). You can get the python script and the R we are going to use here. Also the csv file we are going to use can also be downloaded here. Also if you are interested in more baby name projects and a web scrapper written in R/Ruby check out Hadley Wickham’s project.

now on to the R.

First We load the data and ggplot

library(ggplot2)
 
#load
names<-read.csv("top_names_since_1950.csv", header = TRUE)
#switch factors to characters
names[,4]<-as.character(names$Female)
names[,3]<-as.character(names$Male)
 
#take a quick look
head(names)

This data runs from 1950 to 2009, provides the year, rank and male and female name. Lets take a quick plot.

p<-ggplot(names,aes(x=Year,y=Rank)) +geom_line(aes(group=Female),colour="#431600",alpha=0.1)
p

We have to use a low alpha or the over-plotting gets to be much. Well it looks pretty but there are a couple issues. First ggplot’s defaults are treating the Rank variable like a number and have the top ranked names at the bottom of the Y-axis. second this doesn’t give you too much insight, its a little much. Lets focus flip the Y-axis and focus on a single name.

Here we get to explore one of the best parts of R in general and ggplot in particular, iterative coding. You write something, run, write some more, run. Here we can just add layers to the ggplot object we just created.

p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis
nm <- "Beth" # enter a female name
p <- p  + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm)
p

We can see that Beth is not as popular as i once was. We can try the same thing with a group of names. Let look at some Male names together. I have some brothers, so we can tune this into a competitive contest of nomenclature.

###MALE####
nm<-c("Malcolm","Ethan","Allen","John","Eric")
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank))
p <- p + geom_line(data = names[which(names$Male %in% nm),], aes(group=Male, colour = Male), alpha = 1, size = 1)
p <- p + geom_line(aes(group=Male),colour="#431600",alpha=0.1)+ opts(title = "Male Baby Name Popularity Since 1950")
p

Comparing the relative fortunes of these names is informative but the chart has two main shortcomings. Charting all ~2500 names gives an interesting pattern and texture but it doesn’t obliviously add to the understanding of the underlying data. At best it is an interesting background, at worst it distracting chart junk. While I don’t want to sound like a fanboy Tufte Evangelist but this borders on chartjunk. We can print it with the background nonsense.

nm<-c("Malcolm","Ethan","John","Eric")
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank))
p <- p + geom_line(data = names[which(names$Male %in% nm),], aes(group=Male, colour = Male), alpha = 1, size = 1)
p <- p + opts(title = "Male Baby Name Popularity Since 1950")
p

This is clearer but there is still a legend lookup issue. It’s unclear which names go with which line. ggplot’s facet_wrap is great for this.

p <- p + facet_wrap(~Male)
p

It is clear that the names with greatest range of ranks are a more interesting subjects. Even though I think that the regal name John is fine subject of study, continued dominance in the name game can get tiresome. Looking into the names with the great rise or fall an interesting exercise. That will be part 2 of this rambling visual diarrhea.