jeudi 2 mai 2019

Speed up a k-samples Anderson-Darling test in R with large data sets in a parallel process

I am performing a series of tests to compare pairs of multiple very large (>1 million entries each) samples.

In order to make the computation more efficient, I nested all the tests in a parallel process of the following type:

library(foreach)
library(doParallel)

cl <- makeCluster(7) 
registerDoParallel(cl)

output<- foreach(j=1:length(regions), .combine = rbind) %dopar% {
  library(dplyr) 

Side note: I have to call the libraries at every iteration since the clusterCall (see this) is not working for a reason I did not manage to identify

  library(kSamples)
  sample1_region<-sample1 %>%
    filter(region==regions[j])
  sample1_region<-sample1_region$value

  sample2_region<-sample2 %>%
    filter(region==regions[j])
  sample2_region<-sample2_region$value

The 2 samples have about 30 million entries for about 30 regions. I want to compare the samples calculating a T-Test, a Kolmogorov–Smirnov test, and an Anderson Darling test:

t_test<-t.test(sample2_region, sample1_region)
ks_test<-ks.test(unique(sample2_region), unique(sample1_region))
ad_test<-ad.test(unique(sample2_region), unique(sample1_region))
output_temp <- data.frame("region"=regions[j]
                           , "sample1_mean" = t_test$estimate[[2]]
                           , "sample2_mean" = t_test$estimate[[1]]
                           , "diff" = t_test$estimate[[1]]-t_test$estimate[[2]]
                           , "t.value" = sprintf("%.3f", t_test$statistic)
                           , "df"= t_test$parameter
                           , "t_p.value" = as.numeric(sprintf("%.3f", t_test$p.value))
                           , "ks_p.value" = as.numeric(sprintf("%.3f", ks_test$p.value))
                           , "ad_p.value" = as.numeric(sprintf("%.3f", ad_test$ad[1,3]))
                           , stringsAsFactors = FALSE)
   output_temp 
}
stopCluster(cl)

The script works perfectly fine, but the kSamples::ad.test (I could not find this function implemented for the k-samples case in other packages) is taking forever to run (>10 minutes per each iteration), while the other two tests take only few seconds. Is there any way I can speed up, or simplify, this process?

Aucun commentaire:

Enregistrer un commentaire