python - Python - For循环数百万行

我有一个数据 c 有很多不同的列。此外， arr 是一个数据帧，对应于 c 的子集： arr = c[c['A_D'] == 'A'] 。

我的代码的主要思想是遍历 --- 数据帧中的所有行并搜索所有可能的情况（在 c arr 数据帧中）应该发生一些特定条件：

只需要遍历行 c['A_D'] == D 和 c['Already_linked'] == 0
The hour in the arr must be less than the hour_aux in the c
arr 数据帧的列 Already_linked 必须为零： arr.Already_linked == 0
Terminal 和 Operator 在c和 arr 数据帧中需要相同

现在，使用布尔索引和 groupby get_group 存储条件：

Groupby the arr 以选择相同的操作员和终端： g = groups.get_group((row.Operator, row.Terminal ))
仅选择小时小于 c 数据帧中的小时并且 Already_linked==0: vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)] 的到达

对于 c 验证所有条件的数据帧中的每一行，都会创建一个 vb 数据帧。自然地，这个数据帧在每次迭代中都有不同的长度。 After creating the vb , my goal is to choose the index of the vb the time between vb.START and c[ x ]。对应于该索引的 FightID 然后存储在 c 列上的数据帧 a 。此外，由于到达与离开相关联， arr 数据帧中的列 Already_linked 从 0 更改为 1。

重要的是要注意 arr 数据框的列 Already_linked 可能会在每次迭代中发生变化（并且 arr.Already_linked == 0 vb 要创建的条件之一 --- 数据帧）。因此，无法并行化此代码。

我已经使用 c.itertuples() 来提高效率，但是由于 c 有数百万行，此代码仍然太耗时。

其他选项也将对每一行使用 pd.apply 。尽管如此，这并不是很简单，因为在每个循环中都有值在 c 和 arr 中发生变化（另外，我相信即使使用 pd.apply 非常慢）。

是否有任何可能的方法在矢量化解决方案中转换此 for 循环（或将运行时间减少 10 倍（如果可能甚至更多））？

初始数据框：

 START     END       A_D     Operator     FlightID    Terminal   TROUND_ID   tot
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  QR001   4   QR002       70
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL001   3   "        "  84
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL001   3   "        "  78
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  VS001   3   "        "  45
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  DL401   3   "        "  9
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  DL401   3   "        "  19
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  DL401   3   "        "  3
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  DL401   3   "        "  32
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  DL401   3   "        "  95
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  DL402   3   "        "  58

所需的输出（一些列被排除在下面的数据框中。只有 a 和 Already_linked 列是相关的）：

     START                    END             A_D  Operator  a   Already_linked
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  0               1
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL402           1
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL401           1
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  No_link_found   0
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  0               0
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  0               1
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  0               0
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  0               0
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  0               0
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  0               1

代码：

 groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
    try:
        g = groups.get_group((row.Operator, row.Terminal))
        vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
        aux = (vb.START - row.x).abs().idxmin()
        c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
        arr.loc[aux, 'Already_linked'] = 1
        continue
    except:
        continue

c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked'])
c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
c['a'] = np.where((c.Already_linked  == 0) & (c.A_D == 'D'),'No_link_found',c['a'])

初始 c 数据帧的代码：

 import numpy as np
import pandas as pd
import io

s = '''
 A_D     Operator     FlightID    Terminal   TROUND_ID   tot
 A   QR  QR001   4   QR002       70
 D   DL  DL001   3   "        "  84
 D   DL  DL001   3   "        "  78
 D   VS  VS001   3   "        "  45
 A   DL  DL401   3   "        "  9
 A   DL  DL401   3   "        "  19
 A   DL  DL401   3   "        "  3
 A   DL  DL401   3   "        "  32
 A   DL  DL401   3   "        "  95
 A   DL  DL402   3   "        "  58
'''

data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)

d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
           '2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
           '2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
           '2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00',
           '2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}

aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)

c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'),
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START -  pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != "        ", 1 ,0)

arr = c[c['A_D'] == 'A']

原文由 Miguel Lambelho 发布，翻译遵循 CC BY-SA 4.0 许可协议

阅读 678

def apply_do_g(it_row): """ This is your function, but using isin and apply """ keep = {'Operator': [it_row.Operator], 'Terminal': [it_row.Terminal]} # dict for isin combined mask holder1 = arr[list(keep)].isin(keep).all(axis=1) # create boolean mask holder2 = arr.Already_linked.isin([0]) # create boolean mask holder3 = arr.hour < it_row.hour_aux # create boolean mask holder = holder1 & holder2 & holder3 # combine the masks holder = arr.loc[holder] if not holder.empty: aux = np.absolute(holder.START - it_row.x).idxmin() c.loc[it_row.name, 'a'] = holder.loc[aux].FlightID # use with apply 'it_row.name' arr.loc[aux, 'Already_linked'] = 1 def new_way_2(): keep = {'A_D': ['D'], 'Already_linked': [0]} df_test = c[c[list(keep)].isin(keep).all(axis=1)].copy() # returns the resultant df df_test.apply(lambda row: apply_do_g(row), axis=1) # g is multiple DataFrames" #call the function new_way_2()

for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples(): vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string) try: aux = (vb.START - row.x).abs().idxmin() print(row.x) c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID'] arr.loc[aux, 'Already_linked'] = 1 continue except: continue

Python - For循环数百万行

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译