Command-Line Worldview - John Dennison

Month

October 2012

3 posts

vim regex printing repeating patterns to build update statements

So in the process of have to write hairy etl scripts I found myself have to write large update statements setting an old record with every field from a new record. This lead to long cookie cutter sql where you end up slitting your own wrists and that make nobody happy. I share this with the hopes that it might hlp someone and more importantly so i remember.

a.field1 = b.field1,

a.field2 = b.field2,

…

The trick i found was to use string replacements in VI. first you get the ddl/description of the table you are updating 

birthdate              | date                  | 

startdate          | date                  | 

enddate          | date                  | 

field1   | character varying(15) | 

field2 | character varying(15) | 

use :%s/[ ]*|.*//g

to get the field name

birthdate


startdate


enddate


field1


field2

next use :%s/^\(.*\)$/a.\1 = b.\1,/g 

to get:

a.birthdate = b.birthdate,

a.startdate = b.startdate,

a.enddate = b.enddate,

a.field1 = b.field1,

a.field2 = b.field2,

Boom. Now you have your update. 

Oct 24, 20123 notes
#vi #vim #regex #sql
Cut and Paste Images into GWT/Chrome

Beyond the statistical computing work that keeps me busy most days, I have spent the last year dipping my toes into web development. Having some experience with python, when the opportunity came up to rewrite our horrible sharepoint site, I jumped on it. Before any one could pipe up we had a django site that has grown to incorporate our entire intranet.

However, our main external facing app remains a GWT/EXT stack and this has proven to be a much steeper hill to climb. For those who do not know, GWT is beautiful piece of engineering, that translates complex applications built in Java into optimized javascript that runs on any browser (IE6+). If that sounds like a terrible idea to you, you’re not alone. If you want to build web applications, use a web language. However, I will not wade into that thicket now. Despite some’s philosophical apprehensions, if you need to build full blown web apps on IE6+ GWT has proven to be a powerful framework.

A feature request came up where we needed to have users upload screen shots to our servers and retrieve them at a future date. Having users take a screen shot, save to file and upload through a simple file upload to server java servlet is a simple, straight forward approach. However, us nerds fetishize complexity. While using gmail I stumbled upon the in-email cut/paste of images directly in the browser. While most users probably didn’t even notice this feature if you write stuff for a living you ogle in jealousy. I GOTS TO GET ME SOME OF THAT.

GWT has no built in ability to handle image pasting from the clip board. However, one of the beautiful mottos of GWT is “when you can not, you JSNI.” Because all your client Java code is mangled into javascript, GWT allows you to write javascript methods that can be called from java code. Here are a couple of snippets i came up with to implement image pasting in GWT. Note this relies on Chrome’s file reader api so it is NOT a cross browser solution and certainly will not work in IE anything. So enough rambling, here is the code.

The core of the approach consists of two functions. The first implements the paste listener: 

https://gist.github.com/3876362

This is an example of a JSNI method. The special /*-{ notation tricks Eclipses into thinking its a comment. One more special note are the $wnd and $doc variables. These are special GWT global variables that allow you to access the window and document objects that all javascript. for example if you had jQuery loaded in the app you would have to use $wnd.$(‘#idstring’)

The javascript is a horrible tangled mess of closures because the FileReader api is async. You define the .onloadend method that is executed when you call the  reader.readAsDataURL(blob) method. This honestly was the most complicated aspect of the feature. async javascript makes my mind hurt. Just make sure you execute the method when you initialize your page.

Inside of the onloadend closure we make a call to a java function that handles the placement of the picture on the page. I have the image widget already placed on my page. Here we are creating a brand new one:

https://gist.github.com/3876583

In the reader.onloadend function you have the base64 encoding, which is the ancii representation of the png file you passed to the java method onPaste. Here are the relevant docs:

http://www.youtube.com/watch?v=3vAnuBtyEYE

This a clever solution which allows you to embed images directly into html. To get the image data all you need to do is call the Image.getURL() method. Be warned when storing this representation back on the server base64 is usually 30% larger then the binary equivalent. We ended up shipping the base64 string to the server through RPC calls but decoding it and storing it as a bytea field in postgres. From there we built a custom servlet that reads the picture out of the database and ships it to the web page. Although the pasting can only occur in Chrome, the image can be read from any browser. Yes even IE6,7 (which cannot render base64).

One more note, be sure to crop off the leading encoding string in the image URL. It will look like: <img src=”data:image/png;base64,iVBORw0KGg” /> make sure you drop the “data:image/png;base64,” part before you store as a binary file. 

Hope this helps. Cheers

Oct 11, 2012
#GWT #Javascript #PNG #Web Development
Code Test

I am trying out a new code highlighter:


puts "Hello"
puts "World"
Oct 11, 2012

September 2012

1 post

Simple Parallel randomForest with doMC package

I have been exploring how to speed up some of my R scripts and have started reading about some amazing corners of R. My first weapon was the Rcpp and RcppArmadillo package. These are wonderful tools and even for someone that has never written c++ before, there are enough to examples and documentation to get started. I hope to share some of these approaches soon. The second and the topic of this missive is parallelization through the package doMC.

When answering a question on stats.stackoverflow, I got to share a simple use case with the randomForest library. I wanted to share that here as well. First a little background on why this works. randomForest is an ensemble technique that trains a group of decision trees on random subset of a training set( a group of trees, forest get it). A new record is presented to this group and what ever class the preponderance of underlying decision trees choose is the classification. For this example, one of the most important aspects of this algorithm is that each decision tree is independently trained on a random subset of variables and records. When in the parallel mindset independent is next godliness, what it means is that you can spread the training of a forest over multiple cores/machines. Below is a simple one machine example.

library("doMC")
library("randomForest")
data(iris)
 
registerDoMC(4) #number of cores on the machine
darkAndScaryForest <- foreach(y=seq(10), .combine=combine ) %dopar% {
   set.seed(y) # not really needed
   rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
}

I passed the randomForest combine function(which will stitch independently trained forests together)  to the similarly named .combine parameter( which controls the function on the output of the loop. The down side is you get no OOB error rate or more tragically variable importance. But this will speed up your training. 

Re-approaching problems with parallelization in mind is a tough problem but i have seen some very real improvements in my speed by doing so. 

Sep 17, 20122 notes
#rstats #R #randomForest #doMC #parrallel #parallel computation

August 2012

1 post

Sampling in Postgres w/ random()

Here is a simple trick for sampling in Postgres. I seem to find a new corners of Postgres every day.

say you have a list of accounts/users and you want a random subset:

select account_number

from customers

where random() > .1

limit 10000;

simple but effective.

Aug 10, 2012
#sql #coding #sampling

June 2012

1 post

R/Python Web Apps

I have a little delinquent on this whole blogging thing but here is a talk I gave to the DC R Group. On a twisted and Rpy2 web application framework that I built for my company. Enjoy

http://bit.ly/NW0Neg

J

Jun 10, 2012
#R #Python #Web Development #Rpy2

January 2012

1 post

Long running R commands: unix screen, nohup and R

I wanted to write a quick post about a useful linux tool for using R. I sometimes have long running R sessions for model training. One solution that I have alluded to is to call your script with Rscript.

The nohup command in linux with push this to the back ground. Even adding mutt(or other email program) to the bottom of your script is extremely useful.

You can save the details of the training into output.txt and then place this command in your R file

system(mutt -s “Master, this is an email from HAL 9000, your commands have been completed” john@foo.com < ~/R/output.txt)

Then you can call the script with nohup Rscript foo.R &

The nohup command runs a command in the background and the ampersand is unix for return the prompt. This will redirect all standard output to a file nohup.out file placed in the directory you called it. However, if you have multiple R file running and one breaks the error gets lost in the common log. Total mess, trust me. The solution is to redirect the output log to individual files with this little thing:

nohup Rscript foo.R > ~/R/logs/foo.out 2>&1 &

This redirects the standard out to a file ~/R/logs/foo.out as well as place all standard error there as well. that is what: 2>&1 is doing.

This is all well and good but we know that most R is written interactively not in nifty self contained scripts. To do this linux has a great program call screen. I and not going to get too fancy with screen but a it has been extremely useful to me. Again i am sshing to remote reserved servers or EC2 instances but this is also useful on a home machine for kicking of an R command and going back to what ever you do on you machine. (ANSII porn or http://xkcd.com/981/). The command is:

screen -S R

This will open a seemingly identical terminal prompt. From here use R(type R…) as you would. When you hit a long running process. Hit Ctrl-a d. This is the screen command to access all screen commands. Ctrl-a d will bring you back to your home prompt. Ctrl-a then typing :quit will exit the screen. When you want to get back, you can type:

screen

Now the -S R options you typed before are optional and if you only have one screen open. I have many screens running (mainly interactive sessions to my databases or ssh to EC2 machines) and getting back to them is a pain unless you name the screens the -S FOO will open a screen called FOO. To reopen type:

screen -Dr FOO

or in our case:

screen -Dr R

The D arg will detach the screen if you exited in a less then graceful manner.

One more trick: if you paste the R train command and the above system command email system() as one text block in the R screen session, then detach you will receive a lovely email when you are done.

Good Luck hacking this the New Year!!!

John

Jan 4, 201286 notes
#R #linux #screen

December 2011

5 posts

Lehmann Primiality Test in R

After a small frustration with the limitations of R 16-bit intergers. I think i have worked out a solution. The issue was i was implementing a bone-headed solution to the modulo/exponent step. So to recap. The Lehmann primality test is a probabilistic prime test. I got the implementation from this article http://davidkendal.net/articles/2011/12/lehmann-primality-test

the basic logic is to take n

1. get a random number a = between 1 and n-1

2. calculate x =(a^(n-1)/2) %% n

3. if x == 0 or x == (-1 %% n) then the number could be prime

The issue happened when calculating x =(a^(n-1)/2) %% n. For even n < 100 the number in this function gets big. So big that the limitation on R integer lengths rounds the number and then the modulo function is meaningless. The way to fix this is to use modular exponentiation. (Thanks to stackoverflow  for bringing up the solution). This allows us to iteratively multiply the  base by the exponent and store the modulo for every step. By doing this we never have to deal with a number so large as to screw with the R limitations.There are some clever highly efficient modular exponentiation algorithms but i am no computer scientist so we will use the simplest(modular exponentiation with modular multiplication)

Here is the full solution:

modexp<-function(a, b, n){
    r = 1
    for (i in 1:b){
        r = (r*a) %% n
    }
    return(r)
}
 
 
primeTest <- function(n, iter){
   a <- sample(1:(n-1), 1)
    lehmannTest <- function(y, tries){
      x <- modexp(y, (n-1)/2, n)   
    if (tries == 0) {
      return(TRUE)
            }else{          
      if ((x == 1) | (x == (-1 %% n))){
        lehmannTest(sample(1:(n-1), 1), (tries-1))
        }else{
        return(FALSE)
      }
    }
  }
   if( n < 2 ){
     return(FALSE)
     }else if (n ==2) {
       return(TRUE)
       } else{
         lehmannTest(a, iter)
         }
}

This type of iterative exponentiation is implemented in Python under pow() so a python implementation is quite easy to build as well. Enjoy. R can get frustrating but a little searching around can get you a far way.

Cheers

J

Dec 21, 20113 notes
#R #prime numbers #primality
I may have ventured into the first circle and Virgil Is NOT Here -- Lehmann Primality Test in R

So a post about Lehmann Primality Tests in HackerNews came across my twitter this morning. Seemed like a great quick coding exercise during lunch. Little Did I know…It is wonderfully simple primality test but i think i have hit up against some R quirks. Here is my code for the test (thanks to David Kenal at the aforementioned link of the implementation in Scheme, this is very much his logic):

the basic logic is to take n

1. get a random number a = between 1 and n-1

2. calculate x =(a^(n-1)/2) %% n

3. if x == 0 or x == (-1 %% n) then the number could be prime

The more you repeat these three steps the more certain you are. So you call it recursively. So the function takes to arguments the number is question and the number of iterations(higher the more certain). This implementation seemed to work.

However… I started testing larger primes and things go terribly wrong. you get incorrect an answers and an ominous sounding message:

Warning message:
probable complete loss of accuracy in modulus 

I believe the issue comes down to floating point conversions. AKA the first circle of the R inferno. (which is a great read and for a dante fan and a nerd a wonderful collisions of worlds)

say for n<-97

> (y^((n-1)/2))
[1] 2.730792e+64

That is a big Number thanking the mod of it doesn’t go so well. above a certain point the mod function breaks down because of floating point rounding. I don’t know if there is an easy solution but i wanted to share my experiences as and if you are “Midway on our life’s journey, I found myself in dark woods, the right road lost” always look to R inferno for help.

Here is a loop to test:

prime_test<-c(2,3,5,7,11,13,17 ,19,23,29,31,37)

for (i in 1:length(prime_test)) {
  print(primeTest(prime_test[i], 50))
}

Dec 20, 20111 note
#R #prime numbers #Rinferno #Dante #lehmann primality test
Dec 16, 2011
Poor, Poor Hillary

This will be the last baby name related post but this came out of part two web scrapping post last month. I was looking for the fastest rising Names. I flip the logic and looked for the fastest declining names in relative popularity.

Out of that exercise this gem fell out.

Here is the messy ggplot code:

nm<-"Hillary"
p<-ggplot(names[which(names$Female %in% nm),],aes(x=Year,y=Rank)) 
p<- p + ylim(max(names$Rank),min(names$Rank)) 
p<- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p<- p + opts(title = "Poor, Poor Hillary: Popularity of the Female Name 'Hillary' \n But we already that the Clinton Prseidency was better to certain women")
matrix.label<-data.frame(Year = 1996 , Rank = 100, Text = "Clinton \n Presidency") # create the custom on graphic text label, kind of a hack
p <- p +  geom_rect(aes(xmin = 1992 , xmax = 2001 , ymin = 1000 , ymax = 1 ),fill = "Blue", alpha = .002) 
p <- p +  geom_rect(aes(xmin = 2007 , xmax = 2008 , ymin = 1000 , ymax = 1 ),fill = "Green", alpha = .002) 
matrix.label2<-data.frame(Year = 2007 , Rank = 100, Text = "Hillary's \n Presidential \n Run")
 
# 1993 ->2001 the years of the Clinton Presdidency but 1992 was when he won the election so that is our starting point
p <- p + geom_text(data = matrix.label, aes(label = Text, size =3))
p <- p + geom_text(data = matrix.label2, aes(label = Text, size = 3))
p

Created by Pretty R at inside-R.org

For the Data look at my github project https://github.com/jofusa/ssa-baby-names

Dec 16, 20119 notes
#R #ggplot #politics #clinton #hillary #presidency
SVN Version Control, R, and some rambling thought on AWS,Rscripts

I do a alot of my modelling on Rstudio hosted on EC2 instances. If you don’t use, I would highly recommend. A brilliant tool. Kudos to the Rstudio team. I have made a personal and professional pledge to obsessively use version control. I hope to show a quick example of how to use version control in the modelling context, even if you are not tweaking the linux kernel. I know Rstudio has version control as an enhancement very soon and i am eagerly awaiting its release, like a virgin on prom night. (perhaps given the nerdy crowd that could bring up painful memories).

The first and obvious use of version control is to keep track of scripts over time. This is exactly what it is designed to do. Even if you came to R as a non-programmer(like me) it good to adopt the best parts. If you are at all familiar with version control this should make plenty of sense. If you are not here are some better high level and tutorials better then i could write:

http://chronicle.com/blogs/profhacker/a-gentle-introduction-to-version-control/23064

More technical introduction svn:

http://svnbook.red-bean.com/

Great Git intro(I professinoally use SVN and just started Git, the jump can be a little steep):

http://library.edgecase.com/git_immersion/index.html

Now to why you are reading this article, the R. So you crunch an afternoon, add some clever backflip to your R project, you should save it, then flip over to the command line and commit to you repo. This hopefully will get even easier with aforementioned Rstudio enhancement. However what about your models, the actual R objects. If you are just exploring and hacking a data set then the longevity of your models are not too important, you got the insight and you keep modeling. However, when you are constantly tweaking and comparing similar models then tracking these changes is incredibly important.

library(randomForest)
seed<-123
set.seed(seed)
 
data(iris) #Load Data
 
start<-Sys.time()
iris_rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
end<-Sys.time()
time<-(end-start)
 
#Save the model
save(iris_rf, file ="~/R/model/iris_rf")
 
#Now we add the the SVN Repo
system("svn add ~/R/model/iris_rf")
 
#Build Comment and Commit
svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="")
 
eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 
#Now get SVN Revision Number     
model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) 
#This awk command grabs the version number, 
#the 'intern =TRUE' redirects the output of the system command to your object instead of stout
#a brilliant useful tool on a linux system
print(model_ver)
#Now we have our revision number

This is clearly a toy example but it illustrates the use. Because I do a lot of database development, something is not useful to me until it stored in a database. That depends on what you are going to use them models for but I like to store out of sample error rates, cross validation results, the specific parameters used in each model run and of course the revision number of the SVN repo. That ties it all together. Now for a further example say we want to tweak the example.

args1 <- commandArgs(TRUE)[1]
library(randomForest)
seed<-args1
set.seed(seed)
 
system("svn update ~/R/model/")
load("~/R/model/iris_rf")
 
data(iris) #Load Data
 
start<-Sys.time()
iris_rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
end<-Sys.time()
time<-(end-start)
 
#we are combining the old model with the new model to add trees to the forest
rf.all <- combine(iris_rf,iris_rf2)
 
#Save the model
save(rf.all, file ="~/R/model/iris_rf")
 
#Now we add the the SVN Repo
system("svn add ~/R/model/iris_rf")
 
#Build Comment and Commit,
svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="")
 
eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 
#Now get SVN Revision Number     
model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) 

I have added a little tweak to the top of the this script the really opens up the power of this approach.

commandArgs(TRUE)[1]

This functions parses command line arguments so they can be used within your R script. Say we saved this file as rf_Iris.R. you could call it from the command line or from within a shell script(where this REALLY gets fun, ok maybe i have a tainted idea of whats fun) like:

Rscript ./rf_Iris.R 456

Now this example of resetting the seed is a little toy, however you could pass in the target variable, model name for rerunning/updating, parametrize a sql call. It get cool i promise you. It turns R from a interactive scripting tool to a batch process. I will post some really cool shell scripts that allow you to spin up EC2 machines to remotely execute Rscripts on the cloud. Combining the randomForest Combine technique used above and a couple EC2 machines you can build a distributed training ‘cluster’. Or more accurately described by Mike Driscoll as “Bash Reduce”. At minimum you can expand your computing power from the comfort of your own command line. (as well as really useful story to pick up girls at the bar. little know fact, talk of distributed cloud based modelling really gets the women).

This framework can be used to dynamically build R statements using command-line arguments. I also like the three line split because you can just run the paste command to see what would run. It is totally kludgey the more complicated the statement it can really get difficult to debug. If someone has a more elegant solution i would love to hear.

eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 

A bit of WARNING: this type of concatenate command building should get your DBA/SysAdmin alarms going off. You should not expose any script that is this hack to web server or let i run by a potentially malicious soul on your machine. It is just asking for a sting injection attack. If you planning to access these scripts through a rApache or make them public facing, you should NOT be concatenating system commands. If you are skilled enough to do that, then you probably don’t need me to tell you.

Cheers

Dec 16, 201126 notes
#R #SVN #Version #Git #Modelling #Linux #Command-Line #AWS #EC2 #Rstudio

November 2011

5 posts

Ulam Spirals in R and ggplot

Having seen a twitter post speed by about Ulam Spirals I started to read up.  As the story goes in 1963 Stanislaw Ulam was bored at conference and started scribbling numbers in a spiral. What he discovered was a strange diaginal pattern of Prime Numbers. I was a little bored as well, so i decided to write up some R to plot these patterns.

Here is the main function. I am sure there are more elegant ways of calculating a Ulam spiral but here is my hack. First we calculate the diagonals and then fill in the gaps.

 
Ulam.Spiral<-function(N){
  if (even(N)){
     cat(sprintf("Error: function only accepts odd integers because it a poorly written and fragile piece of code.\n"))
  }else{
m <- matrix(NA, nrow=(N), ncol=N)
    top.left<-c(1,1)
    bottom.right<-c(N,N)
    top.right<-c(1,N)
    bottom.left<-c(N,1)
    n<-N
    a<-N
    m[median(1:N),median(1:N)]<-1
 
while(a>=3){
# This is an adaptation of a Euler Problem 28 solution. It calculates the diaginals.
    m[bottom.right[1],bottom.right[2]]<- a^2 #bottom right
    m[top.left[1],top.left[2]]<- a^2 - 2*a + 2 # top left
    m[top.right[1],top.right[2]]<- a^2 - 3*a + 3 # top right
    m[bottom.left[1],bottom.left[2]]<- a^2 - a + 1 #bottom left
 
#Both Horizontal Rows
    m[top.left[1],(top.left[2]+1):(top.right[2]-1)]<-seq( m[top.left[1],top.left[2]]-1,m[top.left[1],top.right[2]]+1) #Fill in top row
    m[bottom.left[1],(bottom.left[2]+1):(bottom.right[2]-1)]<-seq(m[bottom.left[1],bottom.left[2]]+1, m[bottom.left[1],bottom.right[2]]-1) #Fill in bottom row
#Left Vertical Rows
    m[(top.left[1]+1):(bottom.left[1]-1), top.left[1]] <- seq(m[top.left[1],top.left[1]]+1, m[bottom.left[1],top.left[1]]-1)#Left Hand Row
#Right Verical Row
   m[(top.right[1]+1):(bottom.right[2]-1),top.right[2]] <-seq(m[top.right[1],top.right[2]]-1 , m[top.right[1],top.right[2]] - (a-2) )
#drop down on square and repeat
    a<-a-2
    top.left<-c(top.left[1]+1,top.left[2]+1)
    bottom.right<-c(bottom.right[1]-1,bottom.right[2]-1)
    top.right<-c(top.right[1]+1,top.right[2]-1)
    bottom.left<-c(bottom.left[1]-1,bottom.left[2]+1)
}
return(m)
}
}

Created by Pretty R at inside-R.org

Here is the output for a 5X5 spiral

     [,1] [,2] [,3] [,4] [,5]
[1,]   17   16   15   14   13
[2,]   18    5    4    3   12
[3,]   19    6    1    2   11
[4,]   20    7    8    9   10
[5,]   21   22   23   24   25

The code is on git hub. https://github.com/jofusa/ulam-spirals-R

Many Thanks to http://librestats.wordpress.com/2011/08/20/prime-testing-function-in-r/ for a useful IsPrime() function and saving me some time.

Here is an example Plot- Prime Number are in red:

image

Nov 29, 20112 notes
#R #ggplot #Prime Numbers #Euler Problems #Ulam Spirals
Popular Baby Names Walk-Through Part 2 - Graphing the fast movers

I will assume you have read through part 1 and have the csv file loaded. While we covered some basic graphing in the last post i hope to get into a little more of the data crunching. Specifically I am interested in the names which where driven by a specific cultural phenomena. There are clearly decade long trends where “Beth” is fading in popularity, but you can not point to a specific event or figure in the public consciousness that is driving this. Enough talk let me show you what I mean. Here is the chart for the female name Trinity. Hat Tip to the google dev python people for pointing this out.

nm<-"Trinity"
p<-ggplot(names,aes(x=Year,y=Rank)) 
p<- p + ylim(max(names$Rank),min(names$Rank)) 
p<- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p<- p + opts(title = "People Liked the Matrix Way Too Much")
matrix.label<-data.frame(Year = 1985 , Rank = 220, Text = "Matrix Released - 1999") # create the custom on graphic text label
p <- p +  geom_rect(aes(xmin = 1998 , xmax = 2000 , ymin = 1000 , ymax = 1 ),fill = "Green", alpha = .002)
p <- p + geom_text(data = matrix.label, aes(label = Text))
p

image

You can see here that some people loved the Matrix enough to brand their child for life. As you can see this rapid rise in popularity in a one or two year period can more easily be attributed to a particular social phenomena. Let take a first stab at this. What are the female names that have the greatest change in our data.

#This would be more elequently done with Hadley Wickam's ddply. The lapply/do.call("rbind") combo is brillinatly useful and for simple things I use
name.min.max<-function(nm){
data.frame(
  name = nm,
  min = min(names[which(names$Female == nm),2]),
  max = max(names[which(names$Female == nm),2]),
  dif = max(names[which(names$Female == nm),2]) - min(names[which(names$Female == nm),2])
  )
}
name.list<-unique(names$Female)
out<-lapply(X = as.list(name.list), FUN = name.min.max)
out <- do.call("rbind", out)
female.dif<-out[order(-out$dif),]
#Top Female Names with greatest change overtime # Need to seperate the winners and losser just plots largest difference
nm<-as.character(female.dif[1:10,1])

What we have done here is built a simple name.min.max function that returns a data frame of for a female name passed in. the lapply() function iterates over a list of distinct Female Names in name.list and returns a list of data frames(one fr each name). do.call(“rbind”,out) stitches them all back together. Then we order by the greatest difference a grab the top ten.

> female.dif[1:10,]
         name min max dif
469       Ava   4 999 995
1232 Samantha   3 998 995
915     Tammy   8 994 986
59      Debra   2 982 980
29     Cheryl  13 992 979
1257    Kayla  11 988 977
38      Kathy  14 990 976
 22 Janice 21 996 975 738 Lily 18 993 975 1844 Taylor 6 978 972

Here is the result lets graph it with a facet for each name.

p<-ggplot(names,aes(x=Year,y=Rank)) 
p <- p + ylim(max(names$Rank),min(names$Rank)) 
p <- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p <- p + opts(title = "Top Movers")
p <- p + facet_wrap(~Female)
p

image

We definitely found names that changed over time but we have names that are rising over time and falling. Also our new name.min.max function doesn’t give us any insight about rate of change. Is it a slow change like Lily or a dramatic rise like Kayla. As we said before names with a dramatic rise over a short period are more likely associated with something in the pop culture of the time.

What we are most interested is the names that jumped the most over a one year period. I tried a couple different techniques to get this and im going to walk through what i think to be the best. I would love to see other people’s approach. To calculated this we are going to use the function Delt() from the package quantmod. Quantmod is a package for testing financial models. If you actually want to make god money with R, stop reading about baby names and go build a trading platform. Delt is used to calculate change in stock prices over time. i.e. 2% up since yesterday. So load up quantmod and see what happens.

library(quantmod)
female.names<-names[,c(1,4,2)]
female.names<-female.names[order(female.names$Female, female.names$Year),]
female.names$delta <- Delt(female.names$Rank)
female.names[1540:1580,]      


      Year Female Rank Delt.1.arithmetic
44018 1994 Alexis   18      -0.333333333
45014 1995 Alexis   14      -0.222222222
46008 1996 Alexis    8      -0.428571429
47008 1997 Alexis    8       0.000000000
48006 1998 Alexis    6      -0.250000000
49003 1999 Alexis    3      -0.500000000
50006 2000 Alexis    6       1.000000000
51005 2001 Alexis    5      -0.166666667
52005 2002 Alexis    5       0.000000000
53007 2003 Alexis    7       0.400000000
54011 2004 Alexis   11       0.571428571
55013 2005 Alexis   13       0.181818182
56014 2006 Alexis   14       0.076923077
57019 2007 Alexis   19       0.357142857
58015 2008 Alexis   15      -0.210526316
59013 2009 Alexis   13      -0.133333333
41952 1991 Alexus  952      72.230769231
42616 1992 Alexus  616      -0.352941176
43354 1993 Alexus  354      -0.425324675
44250 1994 Alexus  250      -0.293785311
45234 1995 Alexus  234      -0.064000000
46178 1996 Alexus  178      -0.239316239
47182 1997 Alexus  182       0.022471910
48184 1998 Alexus  184       0.010989011
49228 1999 Alexus  228       0.239130435
50252 2000 Alexus  252       0.105263158
51281 2001 Alexus  281       0.115079365
52317 2002 Alexus  317       0.128113879
53370 2003 Alexus  370       0.167192429
54425 2004 Alexus  425       0.148648649
55529 2005 Alexus  529       0.244705882
56547 2006 Alexus  547       0.034026465
57627 2007 Alexus  627       0.146252285
58719 2008 Alexus  719       0.146730463
59895 2009 Alexus  895       0.244784423
48992 1998 Alexys  992       0.108379888
49821 1999 Alexys  821      -0.172379032
50842 2000 Alexys  842       0.025578563
51837 2001 Alexys  837      -0.005938242
52909 2002 Alexys  909       0.086021505
53958 2003 Alexys  958       0.053905391

We subsetted the female names into their own object, ordered by name and year(this is important when calculating a rolling change). We opened up a slice of the new object and there are some problems. First is Delt() function is returning arithmetic differences(that is the difference in the two observation divided by the first). This would be a great help if the numbers it was calculating were truly continuous. Remember, these are ordinal rankings.

1998 Alexis    6      -0.250000000
1999 Alexis    3      -0.500000000

This change in Alexis ranking between 1998 and 1999 is being a calculated as a 50% increase. This isn’t strictly meaningful because what we want to calculate is the absolute change. This brings up my second favorite part of R. If you have a problem with how function is behaving, just go change the function.

we can download the source file for quantmod:

wget http://cran.r-project.org/src/contrib/quantmod_0.3-17.tar.gz

extract the tar file and grep the function you care about:

tar -zxvf quantmod_0.3-17.tar.gz

grep Delt ./quantmod/R/*

We get a R source file named ./quantmod/R/OHLC.transformations.R that contains the function we care about. Here is the original definition:

`Delt` <-
function(x1,x2=NULL,k=0,type=c('arithmetic','log'))
{
    x1 <- try.xts(x1, error=FALSE)
    type <- match.arg(type[1],c('log','arithmetic'))
    if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length');
    if(is.null(x2)){
        x2 <- x1 #copy for same symbol deltas
        if(length(k) < 2) {
            k <- max(1,k)
        }
    }
    dim(x2) <- NULL  # allow for multiple k matrix math to happen
    if(type=='log') {
        xx <- lapply(k, function(K.) {
                log(unclass(x2)/Lag(x1,K.))
              })
    } else {
        xx <- lapply(k, function(K.) {
                unclass(x2)/Lag(x1,K.)-1
              })
    }
    xx <- do.call("cbind", xx)
    colnames(xx) <- paste("Delt",k,type,sep=".")
    reclass(xx,x1)
}

There is a default type parameter that if not set will apply to the row.

unclass(x2)/Lag(x1,K.)-1

This is how the change is normalize. what we want is:

unclass(x2) - Lag(x1,K.)  

What we can do is add another type option(absolute) and add an else if to handle it. We can run this to define a new function and apply it to the data:

`Delt.Absolute` <-
function(x1,x2=NULL,k=0,type=c('arithmetic','log'))
{
    x1 <- try.xts(x1, error=FALSE)
    type <- match.arg(type[1],c('log','arithmetic', 'absolute'))
    if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length');
    if(is.null(x2)){
        x2 <- x1 #copy for same symbol deltas
        if(length(k) < 2) {
            k <- max(1,k)
        }
    }
    dim(x2) <- NULL  # allow for multiple k matrix math to happen
    if(type=='log') {
        xx <- lapply(k, function(K.) {
                log(unclass(x2)/Lag(x1,K.))
              })
    } else if (type=='absolute') {
        xx <- lapply(k, function(K.) {
                unclass(x2) - Lag(x1,K.)  
              })
    } else {
       xx <- lapply(k, function(K.) {
                unclass(x2)/Lag(x1,K.)-1
              })
 
    }
    xx <- do.call("cbind", xx)
    colnames(xx) <- paste("Delt",k,type,sep=".")
    reclass(xx,x1)
}
 
female.names<-names[,c(1,4,2)]
female.names<-female.names[order(female.names$Female, female.names$Year),]
female.names$delta <- Delt.Absolute(female.names$Rank,k=1, type = "absolute")
female.names[20:30,]

      Year  Female Rank Delt.1.absolute
28913 1978   Aaron  913             -33
29990 1979   Aaron  990              77
30883 1980   Aaron  883            -107
31956 1981   Aaron  956              73
32969 1982   Aaron  969              13
33881 1983   Aaron  881             -88
49978 1999 Abagail  978              97
50954 2000 Abagail  954             -24
51920 2001 Abagail  920             -34
52882 2002 Abagail  882             -38
53868 2003 Abagail  868             -14

now the function returns an absolute change. We still have an issue. look at the difference calculated between Aaron and Abagail. It is calculating between names, not good. To solve im going to use a trick I read on SO.

female.names[c(TRUE, female.names$Female[-1] != female.names$Female[-length(female.names$Female)]), 4] <- NA
female.names<-female.names[order( female.names[4]),]
 

This elegant line of code take the first instance of all the names in our data frame and replaces the Delt col with a NA(R’s missing data handler). There is some tweaking if you want to use it on numeric data but it is a clever piece of code. After we reorder and extract the top ten names with greatest one year positive change(which is actually a negative number from our Delt,Absolute function, think a change in Rank 10 -> 2 is actually -8 change). Lets plot these top ten.

female.names.winners<-female.names[order(female.names[4]),]    
top.female.names<-unique(female.names.winners[1:12,2])
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank)) + geom_line(data = names[which(names$Female %in% top.female.names),], aes(group=Female, colour = Female), alpha = 1, size = 1)+ opts(title = "Female Baby Name Popularity Since 1950")+ opts(axis.text.x=theme_text(angle=-70),hjust=0) + facet_wrap(~ Female)
p

image

As you can see there are some names that flash up but do not stick around. However Some names shoot up and stayed popular. Take Desiree for example. Lets take a closer look.

p<-ggplot(names,aes(x=Year,y=Rank)) 
p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis
nm <- "Desiree" # enter a female name
p <- p  + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm)
p

image

Now I have never seen this movie but clearly Brando has an effect. Desiree (1954)

I hope you found some of this useful. There are some fun association that can be found. Have fun. Next Up fun with Baby Names Part 3: Biggest Losser’s

Nov 21, 20115 notes
#R #ggplot #quantmod #web scraping #python
Popular Baby Names Walk-Through Part 1 - Web Scrapping and ggploting

This is the first walk-through I have posted. Reading these types of posts has been incredibly helpful as I have been learning R and other useful tools in the Unix universe. Hopefully you find it helpful.

First, I have been watching Google Python Videos the last couple days and they have a coding assignment using Social Security Administration Data Baby Names. Not having the downloads for the course I thought it would be a good python exercise to try to get the same data. So, my interest in baby names has nothing to do with any impending life decision(or any recent drunken decisions). You can get the python script and the R we are going to use here. Also the csv file we are going to use can also be downloaded here. Also if you are interested in more baby name projects and a web scrapper written in R/Ruby check out Hadley Wickham’s project.

now on to the R.

First We load the data and ggplot

library(ggplot2)
 
#load
names<-read.csv("top_names_since_1950.csv", header = TRUE)
#switch factors to characters
names[,4]<-as.character(names$Female)
names[,3]<-as.character(names$Male)
 
#take a quick look
head(names)

This data runs from 1950 to 2009, provides the year, rank and male and female name. Lets take a quick plot.

p<-ggplot(names,aes(x=Year,y=Rank)) +geom_line(aes(group=Female),colour="#431600",alpha=0.1)
p

image

We have to use a low alpha or the over-plotting gets to be much. Well it looks pretty but there are a couple issues. First ggplot’s defaults are treating the Rank variable like a number and have the top ranked names at the bottom of the Y-axis. second this doesn’t give you too much insight, its a little much. Lets focus flip the Y-axis and focus on a single name.

Here we get to explore one of the best parts of R in general and ggplot in particular, iterative coding. You write something, run, write some more, run. Here we can just add layers to the ggplot object we just created.

p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis
nm <- "Beth" # enter a female name
p <- p  + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm)
p

image

We can see that Beth is not as popular as i once was. We can try the same thing with a group of names. Let look at some Male names together. I have some brothers, so we can tune this into a competitive contest of nomenclature.

###MALE####
nm<-c("Malcolm","Ethan","Allen","John","Eric")
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank))
p <- p + geom_line(data = names[which(names$Male %in% nm),], aes(group=Male, colour = Male), alpha = 1, size = 1)
p <- p + geom_line(aes(group=Male),colour="#431600",alpha=0.1)+ opts(title = "Male Baby Name Popularity Since 1950")
p

image

Comparing the relative fortunes of these names is informative but the chart has two main shortcomings. Charting all ~2500 names gives an interesting pattern and texture but it doesn’t obliviously add to the understanding of the underlying data. At best it is an interesting background, at worst it distracting chart junk. While I don’t want to sound like a fanboy Tufte Evangelist but this borders on chartjunk. We can print it with the background nonsense.

nm<-c("Malcolm","Ethan","John","Eric")
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank))
p <- p + geom_line(data = names[which(names$Male %in% nm),], aes(group=Male, colour = Male), alpha = 1, size = 1)
p <- p + opts(title = "Male Baby Name Popularity Since 1950")
p

image

This is clearer but there is still a legend lookup issue. It’s unclear which names go with which line. ggplot’s facet_wrap is great for this.

p <- p + facet_wrap(~Male)
p

image

It is clear that the names with greatest range of ranks are a more interesting subjects. Even though I think that the regal name John is fine subject of study, continued dominance in the name game can get tiresome. Looking into the names with the great rise or fall an interesting exercise. That will be part 2 of this rambling visual diarrhea.

Nov 20, 20112 notes
#R #python #ggplot
Pretty R Test

p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank)) + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2)+geom_line(aes(group=Female),colour="#431600",alpha=0.1)+ opts(title = 'Popularity of the Female Name')+ opts(axis.text.x=theme_text(angle=-45),hjust=0)
p

Created by Pretty R at inside-R.org

Nov 20, 2011
Nov 18, 2011
Next page →
2011 2012
  • January 1
  • February
  • March
  • April
  • May
  • June 1
  • July
  • August 1
  • September 1
  • October 3
  • November
  • December
2011 2012
  • January
  • February
  • March
  • April
  • May
  • June
  • July
  • August
  • September
  • October
  • November 5
  • December 5