r - Calculate number of distinct instances occurring in a given time period -


i have dummy data

structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7,  7), policy_num = c(41551662l, 50966414l, 43077202l, 46927463l,  57130236l, 57050065l, 26196559l, 33545119l, 52304024l, 73953064l,  50340507l, 50491162l, 76577511l, 108067534l), product = c("apple",  "apple", "pear", "apple", "apple", "apple", "plum", "apple",  "pear", "apple", "apple", "apple", "pear", "pear"), start_date =  structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483,  16351, 15429, 15421, 16474, 17205), class = "date"), end_date = structure(c(15068,  16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447,  15794, 15786, 17205, 17570), class = "date")), .names = c("id",  "policy_num", "product", "start_date", "end_date"), row.names = c(na,  -14l), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)   id policy_num product start_date   end_date  1   41551662   apple 2007-04-04 2011-04-04  1   50966414   apple 2012-04-04 2014-04-04  2   43077202    pear 2009-02-01 2018-02-01  3   46927463   apple 2011-11-16 2012-11-16  3   57130236   apple 2012-11-16 2012-11-16  3   57050065   apple 2012-11-16 2014-11-16  4   26196559    plum 1998-11-17 2007-11-17  5   33545119   apple 2011-05-23 2014-06-13  5   52304024    pear 2012-05-23 2018-05-23  5   73953064   apple 2014-10-08 2017-10-08  6   50340507   apple 2012-03-30 2013-03-30  7   50491162   apple 2012-03-22 2013-03-22  7   76577511    pear 2015-02-08 2017-02-08  7  108067534    pear 2017-02-08 2018-02-08 

based on it, i'd calculate following variables (grouped user_id):

1) number of held product (no_prod_now) - number of distinct products, end_date > evaluated start_date. simply, number of products held user_id @ time of start_date

2) number of held active policies (no_policies_now) - above, applied policy_num

3) number of policies opened within 3 months prior current start_date (policies_open_3mo)

4) policies_closed_3mo - above, number of closed policies in past 3 months

the desirable output this:

 id policy_num product start_date   end_date no_prod_now no_policies_now policies_closed_3mo   1   41551662   apple 2007-04-04 2011-04-04           1               1                   0   1   50966414   apple 2012-04-04 2014-04-04           1               1                   0   2   43077202    pear 2009-02-01 2018-02-01           1               1                   0   3   46927463   apple 2011-11-16 2012-11-16           1               1                   0   3   57130236   apple 2012-11-16 2012-11-16           1               1                   1   3   57050065   apple 2012-11-16 2014-11-16           1               1                   2   4   26196559    plum 1998-11-17 2007-11-17           1               1                   0   5   33545119   apple 2011-05-23 2014-06-13           1               1                   0   5   52304024    pear 2012-05-23 2018-05-23           2               2                   0   5   73953064   apple 2014-10-08 2017-10-08           2               2                   0   6   50340507   apple 2012-03-30 2013-03-30           1               1                   0   7   50491162   apple 2012-03-22 2013-03-22           1               1                   0   7   76577511    pear 2015-02-08 2017-02-08           1               1                   0   7  108067534    pear 2017-02-08 2018-02-08           1               1                   1 policies_open_3mo                 0                 0                 0                 0                 0                 1                 0                 0                 1                 0                 0                 0                 0                 0 

i'm looking solution implemented ideally in data.table, i'm going apply big data volumes, base r or dplyr solutions convert data.table, o valuable, thanks!

this quite tricky can solved number of non-equi self-joins.

edit: has turned out update on join doesn't work non-equi self-joins had expected (see here). so, had revise code avoid updates in place.

instead, 4 additional columns created 3 separate non-equi self-joins , combined final result.

library(data.table) library(lubridate)  result <-    # create helper column previous 3 months periods.   # lubridate's month arithmetic avoids nas @ end of month, e.g., february   dt[, start_date_3mo := start_date %m-% period(month = 3l)][     # start "cbind()" original columns   , c(.sd,        # count number of products , policies held @ time of start_date        dt[dt, on = c("id", "start_date<=start_date", "end_date>start_date"),           .(no_prod_now = uniquen(product), no_pols_now = uniquen(policy_num)),           = .eachi][, c("no_prod_now", "no_pols_now")],       # policies closed within previous 3 months of start_date       dt[dt, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"),           .(pols_closed_3mo = .n), = .eachi][, "pols_closed_3mo"],       # additional policies opened within previous 3 months of start_date       dt[dt, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"),           .(pols_opened_3mo = .n - 1l), = .eachi][, "pols_opened_3mo"])][            # omit helper column            , -"start_date_3mo"] result 
    id policy_num product start_date   end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo  1:  1   41551662   apple 2007-04-04 2011-04-04           1           1               0               0  2:  1   50966414   apple 2012-04-04 2014-04-04           1           1               0               0  3:  2   43077202    pear 2009-02-01 2018-02-01           1           1               0               0  4:  3   46927463   apple 2011-11-16 2012-11-16           1           1               0               0  5:  3   57130236   apple 2012-11-16 2012-11-16           1           1               2               1  6:  3   57050065   apple 2012-11-16 2014-11-16           1           1               2               1  7:  4   26196559    plum 1998-11-17 2007-11-17           1           1               0               0  8:  5   33545119   apple 2011-05-23 2014-06-13           1           1               0               0  9:  5   52304024    pear 2012-05-23 2018-05-23           2           2               0               0 10:  5   73953064   apple 2014-10-08 2017-10-08           2           2               0               0 11:  6   50340507   apple 2012-03-30 2013-03-30           1           1               0               0 12:  7   50491162   apple 2012-03-22 2013-03-22           1           1               0               0 13:  7   76577511    pear 2015-02-08 2017-02-08           1           1               0               0 14:  7  108067534    pear 2017-02-08 2018-02-08           1           1               1               0 

note there discrepancies policies opened within 3 previous months before start_date between op's expected result , result here. id == 3, there 2 policies starting both on 2012-11-16, it's 1 additional policy count each row. id == 5, start_date differ more 3 months, there shouldn't overlap.

also, rows 5 , 6 both show value of 2 policies closed within 3 previous months before start_date because id == 3 has 2 policies ending on 2012-11-16.


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -