“pyarrow.lib.ArrowInvalid:从 timestamp\[ns\] 转换为 timestamp\[ms\] 会丢失数据”在没有架构的情况下将数据发送到 BigQuery

新手上路,请多包涵

我正在编写一个脚本,将数据帧发送到 BigQuery:

 load_job = bq_client.load_table_from_dataframe(
    df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)

# Wait for the load job to complete
return load_job.result()

这工作正常,但前提是已经在 BigQuery 中定义了模式,或者我在脚本中定义了我的作业模式。如果没有定义架构,我会出现以下错误:

 Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'

为什么 pyarrow 产生这个错误?除了预定义架构外,我该如何解决它?

原文由 Simon Breton 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 847
2 个回答

从 pandas 转换为 Arrow 或 Parquet 时的默认行为是不允许静默数据丢失。在进行转换时可以设置一些选项,以允许不安全的转换导致时间戳精度丢失或其他形式的数据丢失。 BigQuery Python API 需要设置这些选项,因此它可能是 BigQuery 库中的错误。我建议报告他们的问题跟踪器 https://github.com/googleapis/google-cloud-python

原文由 Wes McKinney 发布,翻译遵循 CC BY-SA 4.0 许可协议

我遇到了同样的错误

当我检查数据框时,我看到了这样的值: 2021-09-30 23:59:59.999999998

您的日期字段可能与 bigquery 默认值不匹配。然后我使用了这段代码:

df['date_column'] =df['date_column'].astype('datetime64[s]')

然后问题就解决了。

原文由 Clegane 发布,翻译遵循 CC BY-SA 4.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进