pyspark 用另一个值替换数据框中的所有值

我的 pyspark 数据框中有 500 列……有些是字符串类型，有些是 int 和一些布尔值（100 个布尔值列）。现在，所有布尔列都有两个不同的级别 - 是和否，我想将它们转换为 ¹⁄₀

对于字符串，我有三个值——通过、失败和空值。如何用 0 替换这些空值？ fillna(0) 仅适用于整数

 c1| c2 |    c3 |c4|c5..... |c500
yes| yes|passed |45....
No | Yes|failed |452....
Yes|No  |None   |32............

当我做

df.replace(yes,1)

我收到以下错误：

 ValueError: Mixed type replacements are not supported

原文由 Emma 发布，翻译遵循 CC BY-SA 4.0 许可协议

阅读 717

对于字符串，我有三个值——通过、失败和空值。如何用 0 替换这些空值？ fillna(0) 仅适用于整数

首先，导入 when 和 lit

 from pyspark.sql.functions import when, lit

假设您的 DataFrame 有这些列

# Reconstructing my DataFrame based on your assumptions
# cols are Columns in the DataFrame
cols = ['name', 'age', 'col_with_string']

# Similarly the values
vals = [
     ('James', 18, 'passed'),
     ('Smith', 15, 'passed'),
     ('Albie', 32, 'failed'),
     ('Stacy', 33, None),
     ('Morgan', 11, None),
     ('Dwight', 12, None),
     ('Steve', 16, 'passed'),
     ('Shroud', 22, 'passed'),
     ('Faze', 11,'failed'),
     ('Simple', 13, None)
]

# This will create a DataFrame using 'cols' and 'vals'
# spark is an object of SparkSession
df = spark.createDataFrame(vals, cols)

# We have the following DataFrame
df.show()

+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|           null|
|Morgan| 11|           null|
|Dwight| 12|           null|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|           null|
+------+---+---------------+

您可以使用：

withColumn() - 指定要使用的列。
isNull() - 如果属性评估为 null，则评估为 true 的过滤器
lit() - 为文字创建一列
when() , otherwise() - 用于检查关于列的条件

我可以用 0 替换具有 null 的值

df = df.withColumn('col_with_string', when(df.col_with_string.isNull(),
lit('0')).otherwise(df.col_with_string))

# We have replaced nulls with a '0'
df.show()

+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|              0|
|Morgan| 11|              0|
|Dwight| 12|              0|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|              0|
+------+---+---------------+

您问题的第 1 部分：是/否布尔值 - 您提到过，布尔值有 100 列。为此，我通常用更新的值重建表或创建一个 UDF 返回 1 或 0 表示是或否。

我正在向 DataFrame (df) 添加另外两列 can_vote 和 can_lotto

 df = df.withColumn("can_vote", col('Age') >= 18)
df = df.withColumn("can_lotto", col('Age') > 16)

# Updated DataFrame will be
df.show()

+------+---+---------------+--------+---------+
|  name|age|col_with_string|can_vote|can_lotto|
+------+---+---------------+--------+---------+
| James| 18|         passed|    true|     true|
| Smith| 15|         passed|   false|    false|
| Albie| 32|         failed|    true|     true|
| Stacy| 33|              0|    true|     true|
|Morgan| 11|              0|   false|    false|
|Dwight| 12|              0|   false|    false|
| Steve| 16|         passed|   false|    false|
|Shroud| 22|         passed|    true|     true|
|  Faze| 11|         failed|   false|    false|
|Simple| 13|              0|   false|    false|
+------+---+---------------+--------+---------+

假设您有与 can_vote 和 can_lotto 相似的列（布尔值为是/否）

您可以使用以下代码行来获取 DataFrame 中具有布尔类型的列

col_with_bool = [item[0] for item in df.dtypes if item[1].startswith('boolean')]

这将返回一个列表

['can_vote', 'can_lotto']

您可以创建一个 UDF 并为此类列表中的每一列迭代，使用 1（是）或 0（否）点亮每一列。

作为参考，请参考以下链接

isNull() ： https ://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/sources/IsNull.html
点亮，时间： https ://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html

原文由 pvy4917 发布，翻译遵循 CC BY-SA 4.0 许可协议

pyspark 用另一个值替换数据框中的所有值

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

pyspark 用另一个值替换数据框中的所有值

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？