R Chronicle: February 2013

A little improvement to the bigcor function proposed on Rmazing to compute huge correlation matrix in R, I made the function work in parallel using all the CPU cores available on the machine. The code is here.

Here is a benchmark of the 2 functions on my machine with 8 cores:

	R <- c(2000, 5000, 10000, 20000, 40000)
	## I hit the limit at ~50000 the ff function refuse to create the matrix.
	# Error in if (length < 0 \|\| length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
	# missing value where TRUE/FALSE needed
	# http://www.bytemining.com/2010/05/hitting-the-big-data-ceiling-in-r/
	normal <- numeric(length=length(R))
	for(i in 1:length(R)){
	split <- ifelse(R[i]<=20000, 10, 20)
	MAT <- matrix(rnorm(R[i] * 10), nrow = 10)
	normal[i] <- system.time(res <- bigcor(MAT, nblocks = split, verbose=FALSE))[3]
	}

	parallel <- numeric(length=length(R))
	for(i in 1:length(R)){
	split <- ifelse(R[i]<=20000, 10, 20)
	MAT <- matrix(rnorm(R[i] * 10), nrow = 10)
	parallel[i] <- system.time(res <- bigcorPar(MAT, nblocks = split, verbose=FALSE))[3]
	}

	d <- data.frame(time=c(normal, parallel), type=rep(c("normal", "parallel"), each=length(R)), size=rep(R, 2))

	library(ggplot2)
	pdf("bigcor_benchmark.pdf", height=7, width=7)
	qplot(size, time, data=d, group=type, colour=type, geom=c("point","path"),
	xlab="Matrix size", ylab="Time in sec.",
	main="Speed comparison bigcor / bigcorPar")
	dev.off()

view raw benchmark.r hosted with ❤ by GitHub

R Chronicle

Sunday, February 24, 2013

Large correlation in parallel