Spark解析DataFrame中的json字段
How to parse a column of json string in Pyspark
在用$spark.sql(\ )$从Table读入数据时,
DataFrame
的列有时是这样一种类型:json
形式的string
。此时,我们通常需要去解析这个json string,从而提取我们想要的数据。
数据准备
1# Sample Data Frame
2jstr1 = u'{"header":{"id":12345,"foo":"bar"},"body":{"id":111000,"name":"foobar","sub_json":{"id":54321,"sub_sub_json":{"col1":20,"col2":"somethong"}}}}'
3jstr2 = u'{"header":{"id":12346,"foo":"baz"},"body":{"id":111002,"name":"barfoo","sub_json":{"id":23456,"sub_sub_json":{"col1":30,"col2":"something else"}}}}'
4jstr3 = u'{"header":{"id":43256,"foo":"foobaz"},"body":{"id":20192,"name":"bazbar","sub_json":{"id":39283,"sub_sub_json":{"col1":50,"col2":"another thing"}}}}'
5df = spark.createDataFrame([Row(json=jstr1),Row(json=jstr2),Row(json=jstr3)])
1+--------------------+
2| json|
3+--------------------+
4|{"header":{"id":1...|
5|{"header":{"id":1...|
6|{"header":{"id":4...|
7+--------------------+
如上所示,我们模拟一个DataFrame,其中只有一列,列名为json
,类型为string
。可以看到,json
中的值为json格式。我们如何从中取出我们关心的值,形成一个单独的列呢?例如:df['header']['id']
.
from_json函数
1from pyspark import Row
2from pyspark.sql.functions import from_json, col
3
4json_schema = spark.read.json(df.select('json').rdd.map(lambda row: row.json)).schema
5df_json = df.withColumn('json', from_json(col('json'), json_schema))
6print(json_schema)
$Result:$
1root
2 |-- body: struct (nullable = true)
3 | |-- id: long (nullable = true)
4 | |-- name: string (nullable = true)
5 | |-- sub_json: struct (nullable = true)
6 | | |-- id: long (nullable = true)
7 | | |-- sub_sub_json: struct (nullable = true)
8 | | | |-- col1: long (nullable = true)
9 | | | |-- col2: string (nullable = true)
10 |-- header: struct (nullable = true)
11 | |-- foo: string (nullable = true)
12 | |-- id: long (nullable = true)
1df_json.select(col('json.header.id').alias('id')).show()
$Result:$
1+-----+
2| id|
3+-----+
4|12345|
5|12346|
6|43256|
7+-----+
1df_json.select(col('json.header.id').alias('id'), col('json.body.name').alias('name')).show()
$Result:$
1+-----+-------+
2| id| name|
3+-----+-------+
4|12345| foobar|
5|12346| barfoo|
6|43256| bazbar|
7+-----+-------+