lundi 10 février 2020

R: t-test for each pairwise combination of a grouping variable, done for every element in an ID variable

Following up on this question, I am trying to add one more layer of difficulty.

I have a data.frame that looks like this:

> set.seed(123)
> mydf <- data.frame(Marker=rep(c('M1','M2'),each=15),
+                    Patient=rep(rep(c('P1','P2','P3'),each=5),2),
+                    Value=sample(1:1000, 30, replace = F))
> mydf
   Marker Patient Value
1      M1      P1   288
2      M1      P1   788
3      M1      P1   409
4      M1      P1   881
5      M1      P1   937
6      M1      P2    46
7      M1      P2   525
8      M1      P2   887
9      M1      P2   548
10     M1      P2   453
11     M1      P3   948
12     M1      P3   449
13     M1      P3   670
14     M1      P3   566
15     M1      P3   102
16     M2      P1   993
17     M2      P1   243
18     M2      P1    42
19     M2      P1   323
20     M2      P1   996
21     M2      P2   872
22     M2      P2   679
23     M2      P2   627
24     M2      P2   972
25     M2      P2   640
26     M2      P3   691
27     M2      P3   530
28     M2      P3   579
29     M2      P3   282
30     M2      P3   143

What I want to do, is to run t.test for each Patient combination (my grouping variable), on a Marker basis (my ID variable).

Based on one answer to the related question above, I know how to do it for one Marker at a time.

I can subset mydf and do the following:

> params_list <- utils::combn(levels(mydf$Patient), 2, FUN = list)
> mydf0 <- subset(mydf, Marker=="M1")
> model_t <- purrr::map(.x = params_list, 
+                       .f = ~ t.test(formula = Value ~ Patient, 
+                       data = subset(mydf0, Patient %in% .x)))
> t_pvals <- purrr::map_dbl(.x = model_t, .f  = "p.value")
> names(t_pvals) <- purrr::map_chr(.x = params_list, .f = ~ paste0(.x, collapse = "-vs-"))
> t_pvals
 P1-vs-P2  P1-vs-P3  P2-vs-P3 
0.3945742 0.5678729 0.7820905

Now I want to do it for all Markers in mydf in an elegant way, and I chose data.table.

I try the following, but I cannot reproduce the above pvalue results for Marker M1.

> group1 <- unlist(lapply(params_list, '[', 1))
> group2 <- unlist(lapply(params_list, '[', 2))
> mydt <- data.table::data.table(mydf)
> results_df <- as.data.frame(mydt[, list(group1= unlist(lapply(params_list, '[', 1)),
+                                         group2= unlist(lapply(params_list, '[', 2)),
+                                         pvalue= purrr::map_dbl(.x = purrr::map(.x = params_list,
+                                                 .f = ~ stats::t.test(formula = Value ~ Patient, paired=FALSE,
+                                                 data = subset(mydt, Patient %in% .x))), .f  = "p.value") ),
+                                  by=list(Marker=Marker)])
> results_df
  Marker group1 group2    pvalue
1     M1     P1     P2 0.8092365
2     M1     P1     P3 0.5156313
3     M1     P2     P3 0.2879954
4     M2     P1     P2 0.8092365
5     M2     P1     P3 0.5156313
6     M2     P2     P3 0.2879954

The structure of results_df is exactly as I want it, but the pvalues are clearly wrong. They are not the same as the ones in the test above for M1, and they are identical for M1 and M2, meaning the same data subset is used in both cases.

I figured I should subset for each Marker as well in the subset command, so I did this instead:

> markers_list <- as.list(levels(mydf$Marker))
> mydt <- data.table::data.table(mydf)
> results_df <- as.data.frame(mydt[, list(group1= unlist(lapply(params_list, '[', 1)),
+                                         group2= unlist(lapply(params_list, '[', 2)),
+                                         pvalue= purrr::map_dbl(.x = purrr::map(.x = params_list, .y = markers_list,
+                                                 .f = ~ stats::t.test(formula = Value ~ Patient, paired=FALSE,
+                                                 data = subset(mydt, Patient %in% .x & Marker==.y))), .f  = "p.value") ),
+                                  by=list(Marker=Marker)])
> results_df
  Marker group1 group2    pvalue
1     M1     P1     P2 0.7337355
2     M1     P1     P3 0.6930669
3     M1     P2     P3 0.3788015
4     M2     P1     P2 0.7337355
5     M2     P1     P3 0.6930669
6     M2     P2     P3 0.3788015

I thought this would be it, but still I'm getting incorrect pvalues, and identical for both M1 and M2 (same data subset still being used for both)...

So now I'm clueless... What am I doing wrong here? What would be the way to do it?

Thanks!

Aucun commentaire:

Enregistrer un commentaire