How to parse a column of json string in Pyspark

在用$spark.sql(\ )$从Table读入数据时,DataFrame的列有时是这样一种类型:json形式的string。此时,我们通常需要去解析这个json string,从而提取我们想要的数据。

数据准备

1# Sample Data Frame
2jstr1 = u'{"header":{"id":12345,"foo":"bar"},"body":{"id":111000,"name":"foobar","sub_json":{"id":54321,"sub_sub_json":{"col1":20,"col2":"somethong"}}}}'
3jstr2 = u'{"header":{"id":12346,"foo":"baz"},"body":{"id":111002,"name":"barfoo","sub_json":{"id":23456,"sub_sub_json":{"col1":30,"col2":"something else"}}}}'
4jstr3 = u'{"header":{"id":43256,"foo":"foobaz"},"body":{"id":20192,"name":"bazbar","sub_json":{"id":39283,"sub_sub_json":{"col1":50,"col2":"another thing"}}}}'
5df = spark.createDataFrame([Row(json=jstr1),Row(json=jstr2),Row(json=jstr3)])
1+--------------------+
2|                json|
3+--------------------+
4|{"header":{"id":1...|
5|{"header":{"id":1...|
6|{"header":{"id":4...|
7+--------------------+

如上所示,我们模拟一个DataFrame,其中只有一列,列名为json,类型为string。可以看到,json中的值为json格式。我们如何从中取出我们关心的值,形成一个单独的列呢?例如:df['header']['id'].

from_json函数

1from pyspark import Row
2from pyspark.sql.functions import from_json, col
3
4json_schema = spark.read.json(df.select('json').rdd.map(lambda row: row.json)).schema
5df_json = df.withColumn('json', from_json(col('json'), json_schema))
6print(json_schema)

$Result:$

 1root
 2 |-- body: struct (nullable = true)
 3 |    |-- id: long (nullable = true)
 4 |    |-- name: string (nullable = true)
 5 |    |-- sub_json: struct (nullable = true)
 6 |    |    |-- id: long (nullable = true)
 7 |    |    |-- sub_sub_json: struct (nullable = true)
 8 |    |    |    |-- col1: long (nullable = true)
 9 |    |    |    |-- col2: string (nullable = true)
10 |-- header: struct (nullable = true)
11 |    |-- foo: string (nullable = true)
12 |    |-- id: long (nullable = true)
1df_json.select(col('json.header.id').alias('id')).show()

$Result:$

1+-----+
2|   id|
3+-----+
4|12345|
5|12346|
6|43256|
7+-----+
1df_json.select(col('json.header.id').alias('id'), col('json.body.name').alias('name')).show()

$Result:$

1+-----+-------+
2|   id|   name|
3+-----+-------+
4|12345| foobar|
5|12346| barfoo|
6|43256| bazbar|
7+-----+-------+

参考链接