dplyr left_join 小于,大于条件

新手上路,请多包涵

这个问题与 在非平凡标准上有效合并两个数据框检查日期是否在 r 中的两个日期之间 的问题有些相关。我在这里发布的一个请求该功能是否存在: GitHub问题

我希望使用 dplyr::left_join() 加入两个数据帧。我用来加入的条件是小于,大于,即 <=>dplyr::left_join() 是否支持此功能?或者键只在它们之间使用 = 运算符。这很容易从 SQL 运行(假设我在数据库中有数据框)

这是一个 MWE:我有两个数据集,一年一次( fdata ),而第二个是每五年发生一次的调查数据。因此,对于在两个调查年份之间的 fdata 中的所有年份,我加入了相应的调查年份数据。

 id <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,
        5,5,5,5,
        8,8,8,8,
        13,13,13)

fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)

sdata <- tbl_df(data.frame(byear, eyear, val))

fdata <- tbl_df(data.frame(id, fyear))

test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

我明白了

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds

除非 left_join 可以处理这种情况,但我的语法缺少什么?

原文由 rajvijay 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 738
2 个回答

使用 filter 。 (但请注意,此答案 不会 产生正确的 LEFT JOIN ;但 MWE 会给出正确的结果,而是使用 INNER JOIN 。)

dplyr 包如果被要求合并两个表而没有要合并的东西会不高兴,所以在下面,我为此目的在两个表中创建一个虚拟变量,然后过滤,然后删除 dummy :

 fdata %>%
    mutate(dummy=TRUE) %>%
    left_join(sdata %>% mutate(dummy=TRUE)) %>%
    filter(fyear >= byear, fyear < eyear) %>%
    select(-dummy)

请注意,如果您在 PostgreSQL 中执行此操作(例如),查询优化器会通过 dummy 变量看到以下两个查询说明:

 > fdata %>%
+     mutate(dummy=TRUE) %>%
+     left_join(sdata %>% mutate(dummy=TRUE)) %>%
+     filter(fyear >= byear, fyear < eyear) %>%
+     select(-dummy) %>%
+     explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"

<PLAN>
Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

并且使用 SQL 更干净地执行此操作会得到 完全相同 的结果:

 > tbl(pg, sql("
+     SELECT *
+     FROM fdata
+     LEFT JOIN sdata
+     ON fyear >= byear AND fyear < eyear")) %>%
+     explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
    SELECT *
    FROM fdata
    LEFT JOIN sdata
    ON fyear >= byear AND fyear < eyear) AS "zzz140"

<PLAN>
Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

原文由 Ian Gow 发布,翻译遵循 CC BY-SA 3.0 许可协议

data.table 从 v 1.9.8 开始添加非 equi 连接

library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place

fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
      .(id, x.fyear, byear, eyear, val)]
#    id x.fyear byear eyear val
# 1:  1    1998  1995  2000   1
# 2:  2    1998  1995  2000   1
# 3:  3    1998  1995  2000   1
# 4:  5    1998  1995  2000   1
# 5:  8    1998  1995  2000   1
# 6: 13    1998  1995  2000   1
# 7:  1    1999  1995  2000   1
# 8:  2    1999  1995  2000   1
# 9:  3    1999  1995  2000   1
#10:  5    1999  1995  2000   1
#11:  8    1999  1995  2000   1
#12: 13    1999  1995  2000   1
#13:  1    2000  2000  2005   5
#14:  2    2000  2000  2005   5
#15:  3    2000  2000  2005   5
#16:  5    2000  2000  2005   5
#17:  8    2000  2000  2005   5
#18: 13    2000  2000  2005   5
#19:  1    2001  2000  2005   5
#20:  2    2001  2000  2005   5
#21:  3    2001  2000  2005   5
#22:  5    2001  2000  2005   5
#23:  8    2001  2000  2005   5
#24:  2    2002  2000  2005   5
#25:  3    2002  2000  2005   5
#26:  2    2003  2000  2005   5
#27:  3    2003  2000  2005   5
#    id x.fyear byear eyear val

您还可以在 1.9.6 中使用 foverlaps 稍加努力。

原文由 eddi 发布,翻译遵循 CC BY-SA 3.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
logo
Stack Overflow 翻译
子站问答
访问
宣传栏