Reproducible example:
library(caTools) #for sample.split function
set.seed(123)
#Creating example data frame
example_df <- data.frame(personID = > c(stringi::stri_rand_strings(1000, 5)),
sex = sample(1:2, 1000, replace=TRUE),
age = round(rnorm(1000, mean=50, sd=15), 0))
#Example of random splitting:
training_set <- example_df[sample.split(example_df$personID),]
test_set <- example_df[-c(training_set$personID),]
#evaluation of variables in test and training data sets:
#Has to approximate 1 (in this case it's 1.2, which is too high)
(sum(training_set$sex == 1) / sum(training_set$sex == 2)) / (sum(test_set$sex == 1) / sum(test_set$sex == 2))
[1] 1.219139
#Has to approximate 1 along the distribution (it's quite good, this is actually what i would expect)
summary(training_set$age) / summary(test_set$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.7143 0.9756 1.0000 1.0032 1.0169 1.0000
Although sample.split function divided age appropriately (distributions match), proportion of males and females differ significantly in sex variable. What function to use for automatic and even split of data into multiple (in this example two) sets, while preserving proportions and distributions of variables?
Aucun commentaire:
Enregistrer un commentaire