I am performing a series of tests to compare pairs of multiple very large (>1 million entries each) samples.
In order to make the computation more efficient, I nested all the tests in a parallel process of the following type:
library(foreach)
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
output<- foreach(j=1:length(regions), .combine = rbind) %dopar% {
library(dplyr)
Side note: I have to call the libraries at every iteration since the clusterCall
(see this) is not working for a reason I did not manage to identify
library(kSamples)
sample1_region<-sample1 %>%
filter(region==regions[j])
sample1_region<-sample1_region$value
sample2_region<-sample2 %>%
filter(region==regions[j])
sample2_region<-sample2_region$value
The 2 samples have about 30 million entries for about 30 regions. I want to compare the samples calculating a T-Test, a Kolmogorov–Smirnov test, and an Anderson Darling test:
t_test<-t.test(sample2_region, sample1_region)
ks_test<-ks.test(unique(sample2_region), unique(sample1_region))
ad_test<-ad.test(unique(sample2_region), unique(sample1_region))
output_temp <- data.frame("region"=regions[j]
, "sample1_mean" = t_test$estimate[[2]]
, "sample2_mean" = t_test$estimate[[1]]
, "diff" = t_test$estimate[[1]]-t_test$estimate[[2]]
, "t.value" = sprintf("%.3f", t_test$statistic)
, "df"= t_test$parameter
, "t_p.value" = as.numeric(sprintf("%.3f", t_test$p.value))
, "ks_p.value" = as.numeric(sprintf("%.3f", ks_test$p.value))
, "ad_p.value" = as.numeric(sprintf("%.3f", ad_test$ad[1,3]))
, stringsAsFactors = FALSE)
output_temp
}
stopCluster(cl)
The script works perfectly fine, but the kSamples::ad.test
(I could not find this function implemented for the k-samples case in other packages) is taking forever to run (>10 minutes per each iteration), while the other two tests take only few seconds. Is there any way I can speed up, or simplify, this process?
Aucun commentaire:
Enregistrer un commentaire