jeudi 16 août 2018

R - Compare and test primary data frame on multiple conditions with secondary data frame and if true assign a given value from secondary data frame

I have two data frames which the following code will provide an example of the data that I have:

df <- data.frame("ID" = c(325, 464, 464, 464, 512, 512, 687, 701, 869), 
             "DATE" = c("2012-11-05", "2014-04-04", "2014-04-05", "2014-04-06", 
                        "2014-10-13", "2014-10-14", "2014-12-12", "2015-02-17", "2015-06-25"))

df2 <- data.frame("ID" = c(325, 464, 512, 701, 869, 922, 954, 989),
              "DATE_1" = c("2012-11-03", "2014-04-01", "2014-10-10", "2015-02-14",
                           "2015-06-20", "2015-07-07", "2015-09-11", "2015-11-23"),

              "DATE_2" = c("2012-11-08", "2014-04-10", "2014-10-19", "2015-02-22",
                           "2015-06-29", "2015-07-13", "2015-09-25", "2015-11-29"),

              "INTERVAL" = c("2012-11-03--2012-08-11 UTC", "2014-04-01--2014-04-10 UTC",
                             "2014-10-10--2014-10-19 UTC", "2015-02-14--2014-02-22 UTC",
                             "2015-06-20--2015-06-29 UTC", "2015-07-07--2015-07-13 UTC",
                             "2015-09-11--2015-09-25 UTC", "2015-11-23--2015-11-29 UTC"),
              "KEY" = c(6, 8, 233, 642, 1233, 2464, 3436, 4366))

Which should yield:

df
   ID     DATE
1  325 2012-11-05
2  464 2014-04-04
3  464 2014-04-05
4  464 2014-04-06
5  512 2014-10-13
6  512 2014-10-14
7  687 2014-12-12
8  701 2015-02-17
9  869 2015-06-25

df2
   ID     DATE_1    DATE_2   INTERVAL                    KEY
1  325 2012-11-03 2012-11-08 2012-11-03--2012-08-11 UTC     6
2  464 2014-04-01 2014-04-10 2014-04-01--2014-04-10 UTC     8
3  512 2014-10-10 2014-10-19 2014-10-10--2014-10-19 UTC   233
4  701 2015-02-14 2015-02-22 2015-02-14--2014-02-22 UTC   642
5  869 2015-06-20 2015-06-29 2015-06-20--2015-06-29 UTC  1233
6  922 2015-07-07 2015-07-13 2015-07-07--2015-07-13 UTC  2464
7  954 2015-09-11 2015-09-25 2015-09-11--2015-09-25 UTC  3436
8  989 2015-11-23 2015-11-29 2015-11-23--2015-11-29 UTC  4366

The interval column was created with lubridate's interval function with "DATE_1" and "DATE_2" as arguments embedded in dplyr's mutate function.

My problem is that I need to test the rows in my primary data.frame (df) if the ID matches the ID in my secondary data.frame (df2) and if the date in df falls within the interval in df2.

I would like R to create a new column in df called "NEW_KEY" and if a row in df tests true under both conditions then R should take the value in the KEY column in df2 from the row in which fit on ID and interval and insert it into the NEW_KEY column in df. If the row tests false under both conditions then R should insert NA in the row under the NEW_KEY column.

Do mind that df contains multiple duplicates which all should tests true under both conditions and should have the same KEY value filled in each row and df also contains ID which has no corresponding row in df2 and should have its NEW_KEY filled with NA.

I have attempted the following two simplistic code

df$NEW_KEY <- ifelse(ifelse(df$ID %in% df2$ID & df$DATE %within% df2$INTERVAL, df2$KEY, NA)

and

df <- mutate(df, NEW_VALUE = ifelse(df$ID %in% df2$ID & df$DATE %within% df2$INTERVAL, df2$KEY, NA))

both which yield the same error message:

1: I as.numeric(a) - as.numeric(int@start) : longer object length is not a multiple of shorter object length
2: I as.numeric(a) - as.numeric(int@start) <= int@.Data : longer object length is not a multiple of shorter object length
3: I as.numeric(a) - as.numeric(int@start) : longer object length is not a multiple of shorter object length

and only seems to fill out some rows in df and NA the rest.

So I'm currently at a halt on what to do. Any help would be greatly appreciated.

Aucun commentaire:

Enregistrer un commentaire