有哪些方法可用于合并时间戳不完全匹配的列？ DF1： date start_time employee_id session_id 01/01/2016 01/01/2016 06:03:13 7261824 871631182 DF2： date start_time employee_id session_id 01/01/2016 01/01/2016 06:03:37 7261824 871631182 我可以加入 [‘date’, ‘employee_id’, ‘session_id’]，但有时同一员工会在同一日期有多个相同的会话，这会导致重复。我可以删除发生这种情况的行，但如果我这样做，我将失去有效的会话。如果 DF1 的时间戳距离 DF2 的时间戳分钟，并且 session_id 和 employee_id 也匹配，是否有一种有效的加入方式？如果有匹配的记录，那么时间戳将总是比 DF1 稍晚，因为事件在未来的某个时间点被触发。 ['employee_id', 'session_id', 'timestamp<5minutes'] 编辑 - 我假设之前有人会遇到这个问题。我正在考虑这样做：在每个数据帧上取我的时间戳创建一个时间戳 + 5 分钟（四舍五入）的列创建一个时间戳列 - 5 分钟（四舍五入）创建一个 10 分钟的间隔字符串以加入文件 > df1['low_time'] = df1['start_time'] - timedelta(minutes=5) > df1['high_time'] = df1['start_time'] + timedelta(minutes=5) > df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str) > > ``` 有人知道如何将这 5 分钟间隔四舍五入到最接近的 5 分钟标记吗？ 02:59:37 - 5 分钟 = 02:55:00 02:59:37 + 5 分钟 = 03:05:00 interval_string = '02:55:00-03:05:00' pd.merge(df1, df2, how = ‘left’, on = [‘employee_id’, ‘session_id’, ‘date’, ‘interval_string’] ”` 有谁知道如何舍弃这样的时间？这似乎可行。你仍然根据日期、员工和会话进行匹配，然后你寻找基本相同的 10 分钟间隔或范围内的时间原文由 trench 发布，翻译遵循 CC BY-SA 4.0 许可协议

新手上路，请多包涵

有哪些方法可用于合并时间戳不完全匹配的列？

DF1：

 date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:13 7261824 871631182

DF2：

 date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:37 7261824 871631182

我可以加入 [‘date’, ‘employee_id’, ‘session_id’]，但有时同一员工会在同一日期有多个相同的会话，这会导致重复。我可以删除发生这种情况的行，但如果我这样做，我将失去有效的会话。

如果 DF1 的时间戳距离 DF2 的时间戳分钟，并且 session_id 和 employee_id 也匹配，是否有一种有效的加入方式？如果有匹配的记录，那么时间戳将总是比 DF1 稍晚，因为事件在未来的某个时间点被触发。

 ['employee_id', 'session_id', 'timestamp<5minutes']

编辑- 我假设之前有人会遇到这个问题。

我正在考虑这样做：

在每个数据帧上取我的时间戳
创建一个时间戳 + 5 分钟（四舍五入）的列
创建一个时间戳列 - 5 分钟（四舍五入）
创建一个 10 分钟的间隔字符串以加入文件

   > df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
   > df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
   > df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
   >
   > ```


有人知道如何将这 5 分钟间隔四舍五入到最接近的 5 分钟标记吗？

02:59:37 - 5 分钟 = 02:55:00

02:59:37 + 5 分钟 = 03:05:00

interval_string = '02:55:00-03:05:00'

pd.merge(df1, df2, how = ‘left’, on = [‘employee_id’, ‘session_id’, ‘date’, ‘interval_string’]

”`

有谁知道如何舍弃这样的时间？这似乎可行。你仍然根据日期、员工和会话进行匹配，然后你寻找基本相同的 10 分钟间隔或范围内的时间

原文由 trench 发布，翻译遵循 CC BY-SA 4.0 许可协议

python pandas

阅读 428

2 个回答

得票最新

社区维基

发布于
2023-01-09

✓ 已被采纳

考虑以下问题的迷你版本：

 from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

这使

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

You would like to treat df2[0:3] as duplicates of df1[0:3] when merging (since they are respectively less than 5 minutes apart), but treat df1[3] and df2[3] 作为单独的会话。

方案一：区间匹配

这基本上就是您在编辑中提出的建议。您希望将两个表中的时间戳映射到以时间戳为中心的 10 分钟间隔，四舍五入到最接近的 5 分钟。

每个间隔都可以由其中点唯一表示，因此您可以合并时间戳上的数据帧四舍五入到最接近的 5 分钟。例如：

 import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

哪个打印

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

请注意，这并不完全正确。会话 df1[2] 和 df2[2] 不被视为重复，尽管它们仅相隔 3 分钟。这是因为它们位于区间边界的不同侧。

方案二：一对一匹配

这是另一种方法，它取决于 --- 中的会话在 df1 中有零个或一个重复项的 df2 。

We replace timestamps in df1 with the closest timestamp in df2 which matches on employee_id and session_id and is less than 5 minutes away.

 from datetime import timedelta

# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
    matches = df2.loc[(df2.employee_id == row.employee_id) &
                      (df2.session_id == row.session_id)]

    deltas = matches.date_start_time - row.date_start_time
    deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]

    try:
        return matches.loc[deltas.idxmin()]
    except ValueError:  # no items
        return row

# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)

# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

哪个打印

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

这种方法要慢得多，因为您必须为 df2 中的每一行搜索整个 df1 。我写的内容可能会进一步优化，但这在大型数据集上仍然需要很长时间。

原文由 Igor Raush 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2023-01-09

我会尝试在熊猫中使用这种方法：

pandas.merge_asof()

您感兴趣的参数将为 tolerance direction ， left_on ， right_on

建立@Igor 回答：

 import pandas as pd
from pandas import read_csv
from io import StringIO

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

输出

date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y

2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN

原文由 jcp 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

基于不完全匹配的时间戳的熊猫合并

方案一：区间匹配

方案二：一对一匹配

输出

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

基于不完全匹配的时间戳的熊猫合并

方案一：区间匹配

方案二：一对一匹配

输出

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？