Command-Line Worldview - John Dennison

RSS

vim regex printing repeating patterns to build update statements

So in the process of have to write hairy etl scripts I found myself have to write large update statements setting an old record with every field from a new record. This lead to long cookie cutter sql where you end up slitting your own wrists and that make nobody happy. I share this with the hopes that it might hlp someone and more importantly so i remember.

a.field1 = b.field1,

a.field2 = b.field2,

The trick i found was to use string replacements in VI. first you get the ddl/description of the table you are updating 

birthdate              | date                  | 
startdate          | date                  | 
enddate          | date                  | 
field1   | character varying(15) | 
field2 | character varying(15) | 

use :%s/[ ]*|.*//g

to get the field name

birthdate

startdate

enddate

field1

field2

next use :%s/^\(.*\)$/a.\1 = b.\1,/g 

to get:

a.birthdate = b.birthdate,
a.startdate = b.startdate,
a.enddate = b.enddate,
a.field1 = b.field1,
a.field2 = b.field2,
Boom. Now you have your update. 

Cut and Paste Images into GWT/Chrome

Beyond the statistical computing work that keeps me busy most days, I have spent the last year dipping my toes into web development. Having some experience with python, when the opportunity came up to rewrite our horrible sharepoint site, I jumped on it. Before any one could pipe up we had a django site that has grown to incorporate our entire intranet.

However, our main external facing app remains a GWT/EXT stack and this has proven to be a much steeper hill to climb. For those who do not know, GWT is beautiful piece of engineering, that translates complex applications built in Java into optimized javascript that runs on any browser (IE6+). If that sounds like a terrible idea to you, you’re not alone. If you want to build web applications, use a web language. However, I will not wade into that thicket now. Despite some’s philosophical apprehensions, if you need to build full blown web apps on IE6+ GWT has proven to be a powerful framework.

A feature request came up where we needed to have users upload screen shots to our servers and retrieve them at a future date. Having users take a screen shot, save to file and upload through a simple file upload to server java servlet is a simple, straight forward approach. However, us nerds fetishize complexity. While using gmail I stumbled upon the in-email cut/paste of images directly in the browser. While most users probably didn’t even notice this feature if you write stuff for a living you ogle in jealousy. I GOTS TO GET ME SOME OF THAT.

GWT has no built in ability to handle image pasting from the clip board. However, one of the beautiful mottos of GWT is “when you can not, you JSNI.” Because all your client Java code is mangled into javascript, GWT allows you to write javascript methods that can be called from java code. Here are a couple of snippets i came up with to implement image pasting in GWT. Note this relies on Chrome’s file reader api so it is NOT a cross browser solution and certainly will not work in IE anything. So enough rambling, here is the code.

The core of the approach consists of two functions. The first implements the paste listener: 

This is an example of a JSNI method. The special /*-{ notation tricks Eclipses into thinking its a comment. One more special note are the $wnd and $doc variables. These are special GWT global variables that allow you to access the window and document objects that all javascript. for example if you had jQuery loaded in the app you would have to use $wnd.$(‘#idstring’)

The javascript is a horrible tangled mess of closures because the FileReader api is async. You define the .onloadend method that is executed when you call the  reader.readAsDataURL(blob) method. This honestly was the most complicated aspect of the feature. async javascript makes my mind hurt. Just make sure you execute the method when you initialize your page.

Inside of the onloadend closure we make a call to a java function that handles the placement of the picture on the page. I have the image widget already placed on my page. Here we are creating a brand new one:

In the reader.onloadend function you have the base64 encoding, which is the ancii representation of the png file you passed to the java method onPaste. Here are the relevant docs:

http://www.youtube.com/watch?v=3vAnuBtyEYE

This a clever solution which allows you to embed images directly into html. To get the image data all you need to do is call the Image.getURL() method. Be warned when storing this representation back on the server base64 is usually 30% larger then the binary equivalent. We ended up shipping the base64 string to the server through RPC calls but decoding it and storing it as a bytea field in postgres. From there we built a custom servlet that reads the picture out of the database and ships it to the web page. Although the pasting can only occur in Chrome, the image can be read from any browser. Yes even IE6,7 (which cannot render base64).

One more note, be sure to crop off the leading encoding string in the image URL. It will look like: <img src=” /> make sure you drop the "data:image/png;base64," part before you store as a binary file. 

Hope this helps. Cheers

Code Test

I am trying out a new code highlighter:


puts "Hello"
puts "World"

Simple Parallel randomForest with doMC package

I have been exploring how to speed up some of my R scripts and have started reading about some amazing corners of R. My first weapon was the Rcpp and RcppArmadillo package. These are wonderful tools and even for someone that has never written c++ before, there are enough to examples and documentation to get started. I hope to share some of these approaches soon. The second and the topic of this missive is parallelization through the package doMC.

When answering a question on stats.stackoverflow, I got to share a simple use case with the randomForest library. I wanted to share that here as well. First a little background on why this works. randomForest is an ensemble technique that trains a group of decision trees on random subset of a training set( a group of trees, forest get it). A new record is presented to this group and what ever class the preponderance of underlying decision trees choose is the classification. For this example, one of the most important aspects of this algorithm is that each decision tree is independently trained on a random subset of variables and records. When in the parallel mindset independent is next godliness, what it means is that you can spread the training of a forest over multiple cores/machines. Below is a simple one machine example.

library("doMC")
library("randomForest")
data(iris)
 
registerDoMC(4) #number of cores on the machine
darkAndScaryForest <- foreach(y=seq(10), .combine=combine ) %dopar% {
   set.seed(y) # not really needed
   rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
}

I passed the randomForest combine function(which will stitch independently trained forests together)  to the similarly named .combine parameter( which controls the function on the output of the loop. The down side is you get no OOB error rate or more tragically variable importance. But this will speed up your training. 

Re-approaching problems with parallelization in mind is a tough problem but i have seen some very real improvements in my speed by doing so. 

Sampling in Postgres w/ random()

Here is a simple trick for sampling in Postgres. I seem to find a new corners of Postgres every day.

say you have a list of accounts/users and you want a random subset:

select account_number

from customers

where random() > .1

limit 10000;

simple but effective.

R/Python Web Apps

I have a little delinquent on this whole blogging thing but here is a talk I gave to the DC R Group. On a twisted and Rpy2 web application framework that I built for my company. Enjoy

http://bit.ly/NW0Neg

J

Jan 4

Long running R commands: unix screen, nohup and R

I wanted to write a quick post about a useful linux tool for using R. I sometimes have long running R sessions for model training. One solution that I have alluded to is to call your script with Rscript.

The nohup command in linux with push this to the back ground. Even adding mutt(or other email program) to the bottom of your script is extremely useful.

You can save the details of the training into output.txt and then place this command in your R file

system(mutt -s “Master, this is an email from HAL 9000, your commands have been completed” john@foo.com < ~/R/output.txt)

Then you can call the script with nohup Rscript foo.R &

The nohup command runs a command in the background and the ampersand is unix for return the prompt. This will redirect all standard output to a file nohup.out file placed in the directory you called it. However, if you have multiple R file running and one breaks the error gets lost in the common log. Total mess, trust me. The solution is to redirect the output log to individual files with this little thing:

nohup Rscript foo.R > ~/R/logs/foo.out 2>&1 &

This redirects the standard out to a file ~/R/logs/foo.out as well as place all standard error there as well. that is what: 2>&1 is doing.

This is all well and good but we know that most R is written interactively not in nifty self contained scripts. To do this linux has a great program call screen. I and not going to get too fancy with screen but a it has been extremely useful to me. Again i am sshing to remote reserved servers or EC2 instances but this is also useful on a home machine for kicking of an R command and going back to what ever you do on you machine. (ANSII porn or http://xkcd.com/981/). The command is:

screen -S R

This will open a seemingly identical terminal prompt. From here use R(type R…) as you would. When you hit a long running process. Hit Ctrl-a d. This is the screen command to access all screen commands. Ctrl-a d will bring you back to your home prompt. Ctrl-a then typing :quit will exit the screen. When you want to get back, you can type:

screen

Now the -S R options you typed before are optional and if you only have one screen open. I have many screens running (mainly interactive sessions to my databases or ssh to EC2 machines) and getting back to them is a pain unless you name the screens the -S FOO will open a screen called FOO. To reopen type:

screen -Dr FOO

or in our case:

screen -Dr R

The D arg will detach the screen if you exited in a less then graceful manner.

One more trick: if you paste the R train command and the above system command email system() as one text block in the R screen session, then detach you will receive a lovely email when you are done.

Good Luck hacking this the New Year!!!

John

Lehmann Primiality Test in R

After a small frustration with the limitations of R 16-bit intergers. I think i have worked out a solution. The issue was i was implementing a bone-headed solution to the modulo/exponent step. So to recap. The Lehmann primality test is a probabilistic prime test. I got the implementation from this article http://davidkendal.net/articles/2011/12/lehmann-primality-test

the basic logic is to take n

1. get a random number a = between 1 and n-1

2. calculate x =(a^(n-1)/2) %% n

3. if x == 0 or x == (-1 %% n) then the number could be prime

The issue happened when calculating x =(a^(n-1)/2) %% n. For even n < 100 the number in this function gets big. So big that the limitation on R integer lengths rounds the number and then the modulo function is meaningless. The way to fix this is to use modular exponentiation. (Thanks to stackoverflow  for bringing up the solution). This allows us to iteratively multiply the  base by the exponent and store the modulo for every step. By doing this we never have to deal with a number so large as to screw with the R limitations.There are some clever highly efficient modular exponentiation algorithms but i am no computer scientist so we will use the simplest(modular exponentiation with modular multiplication)

Here is the full solution:

modexp<-function(a, b, n){
    r = 1
    for (i in 1:b){
        r = (r*a) %% n
    }
    return(r)
}
 
 
primeTest <- function(n, iter){
   a <- sample(1:(n-1), 1)
    lehmannTest <- function(y, tries){
      x <- modexp(y, (n-1)/2, n)   
    if (tries == 0) {
      return(TRUE)
            }else{          
      if ((x == 1) | (x == (-1 %% n))){
        lehmannTest(sample(1:(n-1), 1), (tries-1))
        }else{
        return(FALSE)
      }
    }
  }
   if( n < 2 ){
     return(FALSE)
     }else if (n ==2) {
       return(TRUE)
       } else{
         lehmannTest(a, iter)
         }
}

This type of iterative exponentiation is implemented in Python under pow() so a python implementation is quite easy to build as well. Enjoy. R can get frustrating but a little searching around can get you a far way.

Cheers

J

I may have ventured into the first circle and Virgil Is NOT Here — Lehmann Primality Test in R

So a post about Lehmann Primality Tests in HackerNews came across my twitter this morning. Seemed like a great quick coding exercise during lunch. Little Did I know…It is wonderfully simple primality test but i think i have hit up against some R quirks. Here is my code for the test (thanks to David Kenal at the aforementioned link of the implementation in Scheme, this is very much his logic):

the basic logic is to take n

1. get a random number a = between 1 and n-1

2. calculate x =(a^(n-1)/2) %% n

3. if x == 0 or x == (-1 %% n) then the number could be prime

The more you repeat these three steps the more certain you are. So you call it recursively. So the function takes to arguments the number is question and the number of iterations(higher the more certain). This implementation seemed to work.

However… I started testing larger primes and things go terribly wrong. you get incorrect an answers and an ominous sounding message:

Warning message:
probable complete loss of accuracy in modulus 

I believe the issue comes down to floating point conversions. AKA the first circle of the R inferno. (which is a great read and for a dante fan and a nerd a wonderful collisions of worlds)

say for n<-97

> (y^((n-1)/2))
[1] 2.730792e+64

That is a big Number thanking the mod of it doesn’t go so well. above a certain point the mod function breaks down because of floating point rounding. I don’t know if there is an easy solution but i wanted to share my experiences as and if you are “Midway on our life’s journey, I found myself in dark woods, the right road lost” always look to R inferno for help.

Here is a loop to test:

prime_test<-c(2,3,5,7,11,13,17 ,19,23,29,31,37)

for (i in 1:length(prime_test)) {
  print(primeTest(prime_test[i], 50))
}

Poor Poor Hillary

Poor Poor Hillary