r - Calculate number of distinct instances occurring in a given time period -
i have dummy data
structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7, 7), policy_num = c(41551662l, 50966414l, 43077202l, 46927463l, 57130236l, 57050065l, 26196559l, 33545119l, 52304024l, 73953064l, 50340507l, 50491162l, 76577511l, 108067534l), product = c("apple", "apple", "pear", "apple", "apple", "apple", "plum", "apple", "pear", "apple", "apple", "apple", "pear", "pear"), start_date = structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483, 16351, 15429, 15421, 16474, 17205), class = "date"), end_date = structure(c(15068, 16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447, 15794, 15786, 17205, 17570), class = "date")), .names = c("id", "policy_num", "product", "start_date", "end_date"), row.names = c(na, -14l), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>) id policy_num product start_date end_date 1 41551662 apple 2007-04-04 2011-04-04 1 50966414 apple 2012-04-04 2014-04-04 2 43077202 pear 2009-02-01 2018-02-01 3 46927463 apple 2011-11-16 2012-11-16 3 57130236 apple 2012-11-16 2012-11-16 3 57050065 apple 2012-11-16 2014-11-16 4 26196559 plum 1998-11-17 2007-11-17 5 33545119 apple 2011-05-23 2014-06-13 5 52304024 pear 2012-05-23 2018-05-23 5 73953064 apple 2014-10-08 2017-10-08 6 50340507 apple 2012-03-30 2013-03-30 7 50491162 apple 2012-03-22 2013-03-22 7 76577511 pear 2015-02-08 2017-02-08 7 108067534 pear 2017-02-08 2018-02-08
based on it, i'd calculate following variables (grouped user_id):
1) number of held product (no_prod_now
) - number of distinct products, end_date
> evaluated start_date
. simply, number of products held user_id
@ time of start_date
2) number of held active policies (no_policies_now
) - above, applied policy_num
3) number of policies opened within 3 months prior current start_date
(policies_open_3mo
)
4) policies_closed_3mo
- above, number of closed policies in past 3 months
the desirable output this:
id policy_num product start_date end_date no_prod_now no_policies_now policies_closed_3mo 1 41551662 apple 2007-04-04 2011-04-04 1 1 0 1 50966414 apple 2012-04-04 2014-04-04 1 1 0 2 43077202 pear 2009-02-01 2018-02-01 1 1 0 3 46927463 apple 2011-11-16 2012-11-16 1 1 0 3 57130236 apple 2012-11-16 2012-11-16 1 1 1 3 57050065 apple 2012-11-16 2014-11-16 1 1 2 4 26196559 plum 1998-11-17 2007-11-17 1 1 0 5 33545119 apple 2011-05-23 2014-06-13 1 1 0 5 52304024 pear 2012-05-23 2018-05-23 2 2 0 5 73953064 apple 2014-10-08 2017-10-08 2 2 0 6 50340507 apple 2012-03-30 2013-03-30 1 1 0 7 50491162 apple 2012-03-22 2013-03-22 1 1 0 7 76577511 pear 2015-02-08 2017-02-08 1 1 0 7 108067534 pear 2017-02-08 2018-02-08 1 1 1 policies_open_3mo 0 0 0 0 0 1 0 0 1 0 0 0 0 0
i'm looking solution implemented ideally in data.table
, i'm going apply big data volumes, base r
or dplyr
solutions convert data.table
, o valuable, thanks!
this quite tricky can solved number of non-equi self-joins.
edit: has turned out update on join doesn't work non-equi self-joins had expected (see here). so, had revise code avoid updates in place.
instead, 4 additional columns created 3 separate non-equi self-joins , combined final result.
library(data.table) library(lubridate) result <- # create helper column previous 3 months periods. # lubridate's month arithmetic avoids nas @ end of month, e.g., february dt[, start_date_3mo := start_date %m-% period(month = 3l)][ # start "cbind()" original columns , c(.sd, # count number of products , policies held @ time of start_date dt[dt, on = c("id", "start_date<=start_date", "end_date>start_date"), .(no_prod_now = uniquen(product), no_pols_now = uniquen(policy_num)), = .eachi][, c("no_prod_now", "no_pols_now")], # policies closed within previous 3 months of start_date dt[dt, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"), .(pols_closed_3mo = .n), = .eachi][, "pols_closed_3mo"], # additional policies opened within previous 3 months of start_date dt[dt, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"), .(pols_opened_3mo = .n - 1l), = .eachi][, "pols_opened_3mo"])][ # omit helper column , -"start_date_3mo"] result
id policy_num product start_date end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo 1: 1 41551662 apple 2007-04-04 2011-04-04 1 1 0 0 2: 1 50966414 apple 2012-04-04 2014-04-04 1 1 0 0 3: 2 43077202 pear 2009-02-01 2018-02-01 1 1 0 0 4: 3 46927463 apple 2011-11-16 2012-11-16 1 1 0 0 5: 3 57130236 apple 2012-11-16 2012-11-16 1 1 2 1 6: 3 57050065 apple 2012-11-16 2014-11-16 1 1 2 1 7: 4 26196559 plum 1998-11-17 2007-11-17 1 1 0 0 8: 5 33545119 apple 2011-05-23 2014-06-13 1 1 0 0 9: 5 52304024 pear 2012-05-23 2018-05-23 2 2 0 0 10: 5 73953064 apple 2014-10-08 2017-10-08 2 2 0 0 11: 6 50340507 apple 2012-03-30 2013-03-30 1 1 0 0 12: 7 50491162 apple 2012-03-22 2013-03-22 1 1 0 0 13: 7 76577511 pear 2015-02-08 2017-02-08 1 1 0 0 14: 7 108067534 pear 2017-02-08 2018-02-08 1 1 1 0
note there discrepancies policies opened within 3 previous months before start_date
between op's expected result , result here. id == 3
, there 2 policies starting both on 2012-11-16, it's 1 additional policy count each row. id == 5
, start_date
differ more 3 months, there shouldn't overlap.
also, rows 5 , 6 both show value of 2 policies closed within 3 previous months before start_date
because id == 3
has 2 policies ending on 2012-11-16.
Comments
Post a Comment