─➤ pip show duckdb
Name: duckdb
Version: 1.1.1
Summary: DuckDB in-process database
Home-page: https://www.duckdb.org
Author:
Author-email:
License: MIT
Location: /home/pon/.local/share/virtualenvs/taisan_console-0970Xq3e/lib/python3.11/site-packages
Requires:
Required-by:
使用下面的代码读取一个 csv 文件,但是会报错
excel_file_path = 'dev/周度产出(9.16-9.22).csv'
query = f"""
SELECT *
FROM '{excel_file_path}'
"""
df = duckdb.query(query).df()
看起来是因为类型的问题
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/pon/code/work/pon/pon-it/taisan_console/dev/found_ma_from_csv_batch.py", line 21, in <module>
df = duckdb.query(query).df()
^^^^^^^^^^^^^^^^^^^^^^^^
duckdb.duckdb.ConversionException: Conversion Error: CSV Error on Line: 25716
Original Line:
2024/9/16-2024/9/22,BM-BQBH-2023-3,央视国际2023年新媒体视频版权监测服务协议,新浪新闻,https://k.sina.cn/article_2110705772_m7dced06c033016q3c.html,萧敬腾郁可唯把李清照的词唱成歌,微博电视,-,-,-,已下线,2024-09-19,系统,中央广播电视总台2024中秋晚会,系统,-
Error when converting column "监测日期". Could not convert string "-" to 'DATE'
Column 监测日期 is being converted as type DATE
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'监测日期': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
file=dev/周度产出(9.16-9.22).csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \r\n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding=0
sample_size=20480
ignore_errors=false
all_varchar=0
所以如何在读取 csv 的时候指定类型?
问了一个 ai,答案都是不行的
chatgpt ❌
query = f"SELECT * FROM '{excel_file_path}' WITH (types={{'监测日期': 'VARCHAR'}})"
df = duckdb.query(query).df()
智谱清言 ❌
query = f"""
SELECT *
FROM '{excel_file_path}'
(监测日期 VARCHAR)
"""
df = duckdb.query(query).df()
豆包 ❌
import duckdb
# 指定列的类型映射
types = {'column_name': 'data_type'}
# 读取 CSV 文件并指定列类型
query = f"SELECT * FROM '{csv_file_path}' (types={types})"
df = duckdb.query(query).df()
https://github.com/duckdb/duckdb/discussions/14131
改成下面这样可以